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ABSTRACT 


UTILITY  OF  AUTOMATIC  CLASSIFICATION  SYSTEMS  FOR 
INFORMATION  STORAGE  AND  RETRIEVAL 

by 

Barry  Lltofsky 

Supervised  by  Professors  Noah  S.  Prywes 

and  David  Lefkovitz 

Large-scale,  on-line  information  storage  and  re¬ 
trieval  systems  pose  numerous  problems  above  those  en¬ 
countered  by  smaller  systems.  The  more  critical  of  these 
problems  involve!  degree  of  automation,  flexibility, 
browsabillty ,  storage  space,  and  retrieval  time.  A  step 
toward  the  solution  of  these  problems  is  presented  here 
arong  with  several  demonstrations  of  feasibility  and 
advantages. 

The  methodology  on  which  this  solution  is  based 
is  that  of  a  posteriori  automatic  classification  of  the 
document  collection.  Feasibility  is  demonstrated  by 
automatically  classifying  a  file  of  50*000  document 
descriptions.  The  advantages  of  automatic  classification 
are  demonstrated  by  establishing  methods  for  measuring 
the  quality  of  classification  systems  and  applying  these 
measure#  to  a  number  of  different  classification  stra¬ 
tegies.  By  indexing  the  50*000  documents  by  two  inde¬ 
pendent  methods,  one  manual  ai. "  one  automatio ,  it  is  shown 
that  these  advantages  are  not  dependent  upon  the  Indexing 
method  used. 

It  was  found  that  among  those  automatic  olaaslfi- 


cation  algorithms  studied,  one  particular  algorithm, 
CLASPY ,  consistently  outperformed  the  others.  In  addi¬ 
tion,  it  was  found  that  this  algorithm  produced  classifi¬ 
cations  at  least  as  good,  with  respect  to  the  measures 
established  in  this  dissertation,  as  the  a  priori,  manual 
classification  system  currently  in  use  with  the  *  .‘ore- 
mentioned  file. 

The  actual  classification  schedules  produced  by 
CLASFY  in  classifying  a  file  of  almost  50,000  document 
descriptions  into  265  categories  are  included  as  an 
appendix  to  this  dissertation. 
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CHAPTER  1 


INTRODUCTION  TO  INFORMATION  STORAGE  AND  RETRIEVAL 


1 .1  Magnitude  of  the  Problem 

The  "information  explosion"  is  here  to  stay  and 
anyone  designing  an  information  storage  and  retrieval  system 
must  take  cognizance  of  that  fact.  Not  only  is  the  current 
publication  rate  high  -  about  350.000  scientific  papers 
per  year  [95]  -  but  the  growth  in  this  rate  is  staggering. 

De  Soils  Price  [  94  ]  has  plotted  the  number  of  scientific 
Journals  (see  Fig.  1*1)  published  each  year  from  the  oldest 
surviving  Journal,  Philosophical  Transactions  of  the  Royal 
Society  of  London  (1665),  until  today,  when  we  are  rapidly 
approaching  100,000  Journals. 

In  1830,  in  order  for  scientists  to  keep  up  with 
the  increasing  number  of  papers  in  the  300  Journals  of  that 
day,  the  first  abstract  Journal  was  introduced.  Since 
then,  as  can  be  seen  in  Fig.  1-1,  the  growth  of  abstract 
Journals  has  essentially  matched  that  of  primary  Journals. 

We  are  now  long  past  the  point  of  300  abstraot  journals 
being  published  yearly  without  a  solution  ooaparable  to 
the  one  found  in  I83O. 

Thus,  any  solution  to  the  problem  of  oolleotlng 
information  and  supplying  desired  information  in  the 
proper  amounts  to  the  proper  people  at  the  proper  time 
must  be  able  to  handle  this  rapid  growth  in  llterat'  ~e 
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Figure  1-1 

Growth  of  Journals  and  Abstraot  Journals 
(fro*  de  Solla  Prloe  £  94]) 
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as  well  as  th  substantial  body  of  publications  which 
have  been  and  which  are  being  produced  in  each  of  the 
technical  fields  today. 

1.2  IS&R  Functions 

Tha  objective  of  all  ISAR  systems  should  be  to 
serve  a  given  community  by  supplying  desired  information 
upon  request.  By  the  word  "serve"  is  meant  that  the 
system  should  function  at  the  convenience  of  the  user. 

This  implies  simplicity  of  use,  multiple  modes  of  use  to 
suit  each  type  of  request,  and  accurate  and  prompt 
responses.  The  degree  to  which  a  system  meets  this  ob¬ 
jective  is  determined  by  how  it  is  set  up,  how  it  is  used, 
and  how  much  money  is  available  for  system  design  and 
operation. 

The  first  parameter  that  must  be  considered  in 
a  system  of  this  type  is  the  size  of  the  collection. 

While  there  are  a  number  of  Instances  where  small  col¬ 
lections  might  be  useful  (i.e.,  a  library  of  a  small 
company  in  a  very  specialized  area  or  a  personal  infor¬ 
mation  system  [25,26,85]),  considering  the  previous 
section  it  can  be  seen  that  there  must  be  a  substantial 
number  of  systems  which,  presently  or  in  the  near  future, 
are  or  will  be  concerned  with  large  collections  -of  infor¬ 
mation.  It  is  to  these  systems  that  this  dissertation 
is  directed.  Henceforth,  all  references  to  ISAfl  systems 
will  automatically  imply  systems  with  information  colleo- 
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tions  of,  at  the  very  least ,  tens  of  thousands  of  Items. 

ISAR  systems  can  handle  a  variety  of  types  of 
Information,  ranging  from  Individual  scientific  facts 
to  merchandise  In  an  Inventory  to  Journal  articles 
and  books  In  a  library.  For  convenience  and  simplicity 
the  discussions  and  examples  following  will  be  restricted 
to  references  to  libraries  with  the  understanding  that 
"information  Items"  or  "documents"  refer  to  any  of  the 
publications  usually  found  In  libraries. 

1.2.1  3torage 

The  storage  functions  of  an  IS&fi  system  are  shown 
In  Pig.  1*2.  The  acquisition  of  documents  for  a  collection 
can  be  more  than  Just  deciding  as  to  the  pertinence  of 
a  particular  dooument  to  t*e  collection.  If  documents 
are  acquired  In  the  proper  formats,  later  stages  In  the 
storage  process  could  be  easily  automated.  Advantage 
should  be  taken  of  advances  In  computerized  typesetting 
and  optical  character  readers  [11  ]  to  facilitate  automatic 
processing  of  docviaents. 

The  purpose  of  the  Indexing  function  is  to  obtain 
a  number  of  descriptors  which  act  as  a  surrogate  for  the 
document.  These  descriptors,  or  keywords,  can  be  obtained 
manually  or  automatically  by  computer  analysis  of  the 
document  title,  abstract  or  tszt.  The  Indexing  function 
will  be  discussed  In  greater  detail  later  in  this  pa^er. 
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Classif ication  of  documents  refers  to  the  grouping 
of  like  documents  into  categories.  The  categories  can  be 
set  up  independent  of  the  documents  (a  priori)  or  after 
all  the  documents  have  been  indexed  {a  posteriori).  In 
the  former  case,  document  classification  can  be  done 
manually  or  automatically  at  the  same  time  as  Indexing. 

In  the  latter  case  it  is  done  only  after  all  the  documents 
have  been  indexed  and  will  probabl r  be  an  automatic 
process.  Most  IS&R  systems  make  no  or  very  little  use 
of  document  clrssif ication. 

Organising  the  document  file  involves  setting  up 
of  directories  and  deciding  the  order  of  the  documents  in 
the  file.  Parts  of  this  step  are  closely  related  to 
document  classification,  particularly  in  an  IS&B  system 
employing  a  posteriori  automatic  classification. 

Document  storage  [ll  ]  and  surrogate  storage  is 
done  separately  because  the  documents  themselves  do  not 
have  to  be  accessed  as  rapidly  as  their  surrogates.  Most 
systems  do  not  Involve  automatic  document  storage.  However, 
this  should  play  an  increasing  role  in  large  scale 
automated  IS&R  systems.  Surrogate  storage  can  Include 
the  above  mentioned  keyword  surrogates,  titles,  authors, 
abstracts  and/or  other  items  (called  association  terms 
by  Prywes  [lOOj)  deemed  to  be  of  use  in  deciding  the 
applicability  of  documents  to  retrieval  requests. 
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1.2.2  Retrieval 

The  basic  retrieval  functions  of  an  IS&R  system 
are  shown  In  Pig.  1-3*  user  formulates  a  request 

and  submits  it  to  the  system.  This  request  could  range 
from  a  well  thought  out  query  to  a  vague  notion  of  what 
i s  desired.  In  fact,  the  user  might  not  know  what  he  is 
looking  for  at  all;  he  might  Just  be  browsing  through 
the  collection  to  see  if  he  can  come  up  with  anything 
of  interest.  Also,  the  user  might  have  need  foi  every 
item  pertaining  tc  his  request,  as  in  a  patent  search, 
or  he  might  be  perfectly  satisfied  with  one  or  a  few  such 
items. 

In  order  to  satisfy  all  of  the  above  needs  an 
IS&R  system  must  be  able  to  provide  the  user  with  the 
cptlon  to  refine  his  request  after  various  stages  of  prelim¬ 
inary  processing  [71  ].  According  to  a  literature  chemist 
In  a  non-automated  library  [83],  the  ability  to  go  back 
to  the  user  In  order  to  refine  the  request  is  desirable 
in  all,  very  helpful  in  many,  and  absolutely  essential 
in  some  literature  searches.  Because  of  the  rapidly 
changing  natures  of  thought  processes  and  user  needs 
this  request  refinement  should  take  place  immediately 
after  the  original  request  is  submitted.  This  implies 
that  an  effective  IS&R  system  should  include  on-line 
man-machine  Interaction  [18,97,113  ]. 

Path  A  of  Fig.  1-3  includes  data  such  as  pre- 
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SYSTEM 


figure  1-3 

Retrieval  of  Documents 


limlnary  document  counts,  suggested  query  modifications, 
classification  hierarchy  display,  and  other  data  which 
might  be  available  before  accessing  either  the  surrogate 
or  the  document  storages.  In  most  of  today’s  systems 
this  very  important  function  is  limited  to  (non-automated) 
discussions  with  a  professional  librarian  or  to  automated 
document  counts  or  more  often,  it  does  not  appear  at  all. 

Path  B  essentially  exists  only  in  on-line  systems. 
This  assumes  that  request  refinement  done  after  a  day  or 
two's  wait  is  almost  identical  with  new  request  submission. 
The  advantages  of  Path  B  can  be  enhanced  if  upon  command 
surrogates  of  documents  similar  to  those  requested  can 
be  displayed.  This  is  relatively  easy  to  do  in  a  classified 
collection. 

Path  C  can  only  exist  if  the  documents  are  stored 
and  can  be  retrieved  or  have  parts  of  them  displayed 
rapidly.  This  feature  is  rare  today  but  should  be  incor¬ 
porated  in  future  systems. 

i . 3  Inadequacies  of  Current  Systems 

A  significant  portion  of  the  library  problem  is 
that  most  of  today's  "automated"  I5&R  systems  are  hardly 
automated  at  all.  On  the  storage  side,  Indexing  and 
cataloging  are  the  main  areas  which  should  be  switched 
from  the  human  to  the  computer  domain.  At  a  recent 
symposium  on  IS&R,  Prywes  [101]  stated i 
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"In  any  one  of  the  large  libraries  or 
information  centers  there  are  thousands  of 
monographs  and  serials  that  are  waiting  to  be 
catalogued  and  indexed.  These  often  lay  un¬ 
used  because  of  the  dearth  of  competent 
cataloguers  and  indexers,  especially  those 
expert  in  particular  subjects  and  languages. 

The  increased  amount  of  material  which  is  being 
circulated  soon  may  require  ‘substantial  in¬ 
crease  in  staff.  Staff  with  this  competence 
is  extremely  scarce;  low  salaries  discourage 
young  people  from  library  work.  For  these 
reasons  the  storage  process  tends  to  consti¬ 
tute  a  serious  bottleneck." 

One  Is  tempted  to  draw  an  analogy  between  auto¬ 
mation  of  libraries  today  and  automation  of  the  tele¬ 
phone  network  some  years  ago.  It  is  said  that  if  the 
dial  did  not  replace  telephone  operators,  all  the  women 
in  this  country  would  have  difficulty  in  handling  today’s 
volume  of  telephone  traffic. 

Computer  processing  of  natural  language  text  for 
indexing,  and  automatic  classification  for  cataloging 
can  break  this  bottleneck. 

On  the  retrieval  side,  libraries  and  Information 
centers  operate  at  low  levels  of  effectiveness  and  are 
called  upon  as  information  sources  a  relatively  low 
percentage  of  the  time  information  is  required  [l0]. 

One  reason  for  this  is  the  indirect  route  one  must  use 
to  use  ISAR  systems.  On-line  Interactive  systems  could 
solve  much  of  this  problem. 

There  are  also  problems  with  the  automated  portion 
of  ISAR  systems.  Present  systems  are  generally  inaffl- 
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clent  and  are  restricted  to  few  modes  of  operation  (l.e., 
none  allow  a  reasonable  degree  of  mechanised  browsing). 
Most  systems  In  use  or  being  proposed  today  are  either 
of  the  serial,  inverted  file  or  list-organized  types. 

Each  of  these  systems  ha.  -  advantages  and  should  be  used 
in  certain  situations.  However,  each  has  serious  draw¬ 
backs  when  used  for  retrieval  by  combination  of  document 
descriptors  (keywords). 

The  main  difficulty  with  a  serial  file  is  the  time 
required  to  access  information.  Even  with  the  high  speed 
computers  of  the  foreseeable  future,  search  times  through 
serial  files  of  millions  of  documents  will  be  on  the  order 
of  many  minutes  or  even  hours.  This  leads  to  the  need  for 
batching  many  requests  and  eliminates  the  possibility  of 
on-line,  real-time  information  retrieval.* 

In  standard  inverted  file  systems,  the  document 
surrogates  are  stored  in  any  order,  usually  by  accession 
number.  Lists  are  maintained  in  a  directory  for  accessing 
this  file.  There  is  one  list  per  keyword  and  the  entries 
In  the  lists  are  pointers  to  document  surrogates.*  Re¬ 
trievals  are  performed  by  logical  comparisons  between 


*  There  are  methods  to  shorten  serial  searches,  though  not 
by  enough  to  upgrade  serial  systems  to  the  real-time 
domain.  Fossum  and  Kaskey  [48]  inquire  as  to  the 
efficacy  of  certain  of  these  methods.  Their  questions 
are  answered  near  the  end  of  Chapter  5  of  this  paper. 
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directory  lists  as  specified  by  queries.  Tables  1-1  and 
1-2.  modified  from  Prywes  [102],  give  examples  of  para¬ 
meters  to  be  encountered  in  typical  IS&R  systems. 

As  may  be  seen  in  item  I  of  Table  1-2,  directory 
size  in  an  inverted  file  system  may  range  from  ten  million 
to  close  to  a  billion  words.  These  must  be  stored  on 
relatively  fast  and  expensive  media  (l.e.,  disks  rather 
than  magnetic  cards  or  strips)  because  of  the  necessity 
of  frequent  access.  In  fact,  the  number  of  directory 
words  required  per  query  (item  J  in  Table  1-2)  might  be 
so  large  compared  to  available  high  speed  storage  as  to 
require  multiple  accessions  of  the  same  lists  in  order  to 
process  a  query,  A  method  which  reduces  these  quantities 
by  an  order  of  magnitude  or  more  is  shown  in  the  next 
chapter  of  this  paper. 

A  recent  study  of  mechanization  In  defense  libraries 
[67]  points  out  some  of  the  shortcomings  in  current  systems. 
All  of  the  systems  studied  (27  In  all)  use  either  serial 
flies  or  Inverted  files.  None  use  automatic  Indexing  or 
automatic  classifications. 

List-organized  document  retrieval  systems  chain 
document  surrogates  together  via  keyword  lists.  In  other 
words,  a  search  on  a  keyword  Involves  Jumping  from  document 
to  document  which  contain  that  keyword.  Thus,  most  of  the 
directory  Is  actually  stored  In  the  surrogate  file  itself. 
Objections  to  this  type  of  storage  for  large  scale  IS4H 
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Number 

A. 

Icem  Records  in  File 

106  to 

10? 

&• 

Keywords  assigned  to  an  Item  (Av. ) 

10 

50 

C. 

Keywords  in  Vocabulary 

104 

5  x 

104 

D. 

Average  Number  of  Items  assigned 
to  the  same  Keyword  =  (AxB)/C 

103 

104 

£. 

Keywords  Specified  in  a  Query  (Av. ) 

10 

50 

F. 

Items  referenced  by  Keywords  in  a 
Query  (Av. )  =  E  x  D 

104 

5  x 

10$ 

Table  1-1 

Illustration  of  Typical  ISAR  Systems 


Directory i 

Number 

G.  Lists  In  Inverted  File  Directory; 

1  list  per  keyword  =  C 

104  to  5  x 

104 

H.  Accession  Numbers  per  list  in  the 
Directory  (Av. )  =  D 

103 

104 

I.  Words  In  Directory  =  G  x  H 

(Assume  1  computer  word  per  Acces¬ 
sion  Number) 

10? 

5  x 

to8 

Retrieval : 


J.  Inverted  File  Directory  Words 

Brought  from  Secondary  to  Primary 
Storage  =  F  (All  accession  nurbers 

In  records  that  correspond  to  query  4  _ 

keywords)  10  5  x  103 


Table  1-2 


Illustration  of  Magnitude  of  Directory 
and  Retrieval  with  Inverted  File  Methodology 
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ay stems  center  about  length  of  the  lists  and  speet*  of  he 
storage  media  required  for  reasonable  search  times*  Dis¬ 
cussion  and  examples  of  list-organized  systems  are  avail- 

ii  • 

able  [48,76,99.103.104,105]. 


CHAPTER  2 


CONCEPTS  OF  IS&B  BASED  ON  AUTOMATIC  CLASSIFICATION 

2.1  Classification  Parameters 

The  Initial  goal  of  classifying  documents  Is  to 
group  "like'' documents  together  into  categories.  The 
documents  (or  document  surrogates)  are  then  placed  near 
each  other  to  facilitate  retrieval.  In  a  conventional 
.library  the  documents  are  placed  on  the  same  or  adjoining 
shelves.  In  an  automated  library  the  document  surrogates 
(plus  room  for  additions)  are  placed  into  convenient 
units  of  memory  such  as  cylinders  on  a  disk  or  magnetic 
strip  or  card.  These  units  which  Include  categories  of 
Information  will  be  called  cells  (sometimes  called 
"buckets"  [23.28,62  ]). 

It  Is  desirable  to  have  a  quantitative  measure  of 
the  ’’likeness"  of  documents.  In  a  collection  of  documents 
indexed  with  keywords,  such  a  measure  la  supplied  by  the 
number  of  keywords  common  to  two  documents.  Extending 
this  notion,  a  measure  of  the  quality  of  a  classification 
system  Is  how  well  the  classification  algorithm  minimizes 
the  average  number  of  different  keywords  in  a  cell. 
Definitions  of  parameters  pertinent  to  this  concept  are 
presented  In  Table  2-1. 

Figure  2-1  shows  the  bounds  on  Njj0,  the  average 
number  of  keys  per  cell.  It  must  be  greater  than  or  equal 
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Definition 


Vocabulary  size,  total  number  of 
different  keywords 

Number  of  documents  In  the  system 

Number  of  cells  In  a  given 
classification 

Average  number  of  keys  assigned  per 
documents  (l.e.,  average  depth  of 
indexing) 

Average  number  of  different  keys 
per  cell  (this  is  the  quantity  to  be 
minimized) 

Average  number  of  cells  a  given 
key  la  assigned  to  =  N0  x  Nfc0 

a; 

Average  number  of  documents  per 
cell  =  Nd/N0 


Table  2-1 

Definitions  of  Parameters 


(log  •CAle) 


Classlf loatl on  Parameters 
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to  the  larger  of  (a)  the  average  number  of  keys  per 

document  and  (b)  Nv/Nc,  the  vocabulary  size  divided  by 
the  number  of  cells.  At  the  same  time,  must  be  les3  than 
or  equal  to  the  smaller  of  (c)  Nv,  the  vocabulary  size 
and  (d)  N^d  x  N<jc,  he  number  of  keys  per  document 
inultlplied  by  the  number  of  documents  per  cell.  Thus  the 
average  number  of  keys  per  cell  for  any  given  number  of 
cells  must  fall  within  the  parallelogram  of  Figure  2-1, 

The  diagonal  dashed  line  represents  the  approxi¬ 
mate  region  of  the  expected  plot  of  Kjtc  vs.  Nc  for  a  good 
classification  system.  This  expectation  has  been  arrived 
at  by  past  experiments  [3*77]  and  those  described  in  this 
paper.  The  actual  path  of  this  curve  depends  not  only  on 
the  classification  algorithm  but  also  upon  the  collection 
Itself.  For  example,  all  keywords  were  unique,  the 
,  cure  would. follow  the  upper  boundary  regardless  of  the 
classification  algorithm  used.  Likewise,  for  any  real 
collection  of  documents,  it  would  be  very  unlikely  for 
the  curve  to  even  approach  the  lower  boundary. 

An  interesting  point  to  consider  is  that  Fig.  2-1 
shows  that  serial  and  Inverted  files  can  be  considered  as 
special  cases  of  classification.  A  serial  file  would 
have  Nc  *=  1  and  &  Nv,  thereby  occurring  at  the  upper 
left  corner  of  the  diagram.  Here,  all  the  documents  are  In 
one  cell  which  Is  searched  serially.  An  inverted  file 
appears  at  the  opposite  corner  with  Nc  •=  and  *=  Njj^. 
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Now  each  cell  contains  only  one  document  as  in  an  inverted 
file. 


2 . 2  Advantages  c  A  Posteriori,  Hierarcnlcal ,  Automatic 
Classification  Systems  for  On-Line  Retrieval  Systems 

Most  IS&R  systems  do  not  use  classification  at 
all.  Of  those  that  do.  many  assign  multiple  categories 
to  each  document  and  use  these  categories  in  place  of 
index  terms  (example:  Universal  Decimal  Classification, 
see  Freeman  [51.52]  and  Freeman  and  Atherton  [53,54]) 
or  use  a  unique  category  assigned  to  a  document  as  an 
additional  index  term.  The  full  potential  of  classifi¬ 
cation  as  an  adjunct  to  indexing  has  not  as  yet  been  ap¬ 
proached.  The  following  sections  discuss  what  can  be  done 
through  the  use  of  the  proper  type  of  classification. 

2.2.1  Automatic  Classification:  Directory  Size  Reduction 
The  magnitude  of  the  inverted  file  directory  was 
shown  in  the  previous  chapter.  The  use  cf  automatic 
-  osif  in"  tion  ' — ""educe  the  cf  t!.1  0  d  1  rpotvir-,  v,y 

more  than  an  order  of  magnitude.  This  is  done  by  forming 
an  inverted  file  directory  on  the  cells,  rather  than  on 
the  individual  documents.  It  is  true  that  once  a  cell 
whose  keys  satisfy  the  query  has  been  located,  a  search 
must  then  be  made  in  the  cell  for  applicable  documents. 
However,  this  is  not  as  much  of  a  hardship  as  one  might 
think  because  ol  the  following  effects: 


1)  Cell  access  time  is  generally  much  greater 
than  transmission  time.  Therefore,  it  is  not 
very  costly  to  read  the  contents  of  an  entire 
cell  If  one  has  to  access  the  cell  for  one  or 
more  documents  anyhow. 

2)  Grouping  logical  document  records  together 
into  larger  physical  records  can  provide 
significant  storage  savings. 

3)  Considerable  transmission  and  processing 

time  (plus,  of  course,  memory  costs)  are  saved 
by  manipulating  much  shorter  directory  lists. 

4)  Memory  accesses  can  be  reduced.  This  is 
covered  in  Section  2.2.5  of  this  paper. 

In  order  to  demonstrate  the  order  of  magnitude 
reduction  in  directory  size,  consider  the  sample  conditions 
presented  in  Section  1.3.  The  expected  N1{C  vs.  Nc  curves 
are  shown  In  Figure  2-2  along  with  their  respective 
parellelograras.  The  circles  represent  numbers  of  cells 
chosen  for  this  example.  The  actual  number  of  cells  in  an 
ISAB  system  will  be  decided  upon  by  a  trade-off  between  a 
number  of  factors  Including: 

1)  Cell  size  should  be  a  convenient  multiple  or 
sub-multiple  of  a  suitable  storage  unit. 

2)  The  fewer  the  cells  the  smaller  the  directory 
and  the  fewer  the  number  of  directory  words 
which  have  to  be  brought  into  high  speed 


000*001 


Niucber  of  Cells 
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storage  for  each  request. 

3)  The  more  cells  there  are  the  fewer  number  of 
documents  per  cell  and  hence  the  shorter  the 
search  through  each  cell. 

4)  More  cells  mean  fewer  keys  per  cell.  This 
increases  the  selectivity  of  each  cell. 

The  appropriate  numbers  for  the  classified  files 
are  given  In  Table  2-2.  This  is  bp?ea  on  10,000  and 
50,000  cells  (Nc)  or  100  and  200  documents  per  cell  (Ndo) 
for  the  two  cases  respectively.  The  order  of  magnitude 
reduction  In  directory  size  (10?  to  5  *  10®  compared  with 
10^  to  2.5  i  10?)  and  in  words  brought  from  secondary  to 
primary  storage  (10^  to  5  x  1 0 5  compared  with  10^  to 
2.5  x  101*)  Is  evident  by  comparison  of  Tables  1-2‘and  2-2. 

2.2.2  Automatic.  A  Posteriori:  Flexible 

All  of  the  well-known  classification  systems  of 
today  are  a  priori  systems.  In  other  words,  the  categories 
and  sub-categories  were  decided  upon  on  the  basis  of  some 
"natural"  divisions  of  knowledge  and  then  the  documents 
were  (and  still  are)  placed  Into  these  categories.  Some 
problems  with  this  traditional  point  of  view  arei 

1)  Few  areas  of  knowledge  can  be  divided  In  a 
truly  "natural"  sense.  For  'example,  should 
biochemistry  be  a  sub-divlslon  of  biology  or 
of  chemistry?  Tv  e  answer  to  this  might  depend 
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Dlrectory : 

Ny  =  Lists  in  Automatic  Classifi¬ 
cation  Directory  (1  list  Der 
term ) 

Nc  =  Number  of  cells  (Circles  in 
Fig.  2-2) 

Nkc  =  Average  number  of  keywords 
per  cell  (Fig.  2-2) 

Ncjc  »  Average  number  of  cells  per 
keyword 

*  Average  number  of  words  in  a 
directory  list  (1  computer 
word  per  entr”) 

Words  in  directory  =  Ny  x  Nck 

Retrievals 

Nck  '  Ccl1  references  per  key 

Directory  words  brought  from 
Secondary  to  Primary  Storage 
=  Nck  x  keys  per  query  (item 
E,  Table  1-1) 


Number 


104  to  5  x  104 

104  5  i  104 

102  5  x  102 

102  5  x  102 

106  2.5  x  107 

102  5  X  102 

103  2.5  x  104 


Table  2-2 

Illustration  of  Magnitude  of  Directory 
and  Retrieval  with  Automatic  Classification 


upon  t..e  rest  of  the  collection 

whether  It  Is  mainly  biological  or  chemical 

In  nature). 

2)  Overlapping  of  disciplines  Is  Increasing. 

3)  The  classification  schedules  of  a  priori 
systems  require  significant  effort  to  be 
kept  up  to  date.  The  Universal  Decimal 
Classification  Is  an  example  of  one  with 
many  outdated  structures  [54]. 

4)  New  areas  of  knowledge  must  fit  into  the 
existing  schedules.  This  results  In  highly 
artificial  hierarchies  of  knowledge.  Figure 
2-3  shows  that  In  the  Dewey  decimal  classifi¬ 
cation.  electrical  engineering  is  cbnsidered 
a  subset  of  mechanical  engl.  eerlng.  It  Is 
clear  that  this  came  about  historically, 

and  not  because  of  today's  view  of  the  subdivi¬ 
sion  of  knowledge. 

5)  Specialized  libraries  make  use  of  small  portions 
oi  exiburig  schedules.  For  instance,  the 
average  technical  library  which  uses  the  Dewey 
decimal  classification  probably  has  90  percent 
or  more  of  Its  documents  filed  in  the  500 

(pure  science)  or  600  (technology)  divisions 
P r2  ].  One  effect  of  this  is  very  deep  Index¬ 
ing,  such  as  621.3841361  for  Communication 
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Instruments  [41  ]. 

6)  Existing  classification  trees  usually  in¬ 
volve  at  least  ten  and  as  many  as  several 
thousand  alternatives  at  each  decision  node. 
Recent  studies  have  shown  that  the  optimum 
number  of  alternatives  at  each  node  is  usually 
(depending  on  certain  parameters)  considerably 
less  than  ten  [_103*104,i3B,l  39]. 

Our  needs  for  Information  are  changing,  therefore 
our  classification  schedules  must  be  capable  of  changing. 
In  an  automatic,  a  posteriori  classification  system, 
the  categories  are  decided  upon  after  all  the  documents 
have  been  indexed.  In  this  way,  the  resulting  system  la 
specifically  designed  for  a  parclcular  collection  at  a 
particular  point  In  time.  If  there  are  significant 
changes  In  the  collection,  the  system  car  be  automatically 
reorganized  to  fully  reflect  the  current  status  of  the 
collection. 

A  major  oojectlon  to  automatically  derived  classif 
cation  categories  is  that  they  might  be  different  from 
those  decided  upon  by  human  beings.  However,  the  quality 
of  a  system  should  be  measured  by  its  convenience  to  the 
'user,  and  not  by  how  the  system  is  originated.  Besides, 
who  knows  that  the  human  1b  right,  and  not  the  machine? 
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2,2,3  Hierarchical.  On-Line:  Browgabla 

In  "The  Conceptual  Foundations  of  Information 
Systems",  Borko  [18]  notes! 

"The  user  searches  for  items  that  are  Inte¬ 
resting,  original,  or  stimulating.  No  ore  can 
find  these  for  him;  he  must  be  able  to  browse 
through  the  data  himself.  In  a  library,  he  wanders 
among  the  shelves  picking  up  documents  that 
strike  his  farcy.  An  automated  information 
system  m”st  provide  similar  capabilities." 

The  ability  to  browse  through  parts  of  a  collection 
snould  be  an  essential  portion  of  every  IS&R  system. 

There  are  many  times  when  one  has  only  a  vague  Idea  of  the 
type  of  document  desired.  Browsing  can  help  channel 
pseudo-random  thoughts  Into  a  direct  line  towards  the 
Information  actually  desired. 

Effective  browsing  demands  a  hierarchical  classifi¬ 
cation  system  In  order  to  enable  one  to  start  with  broad 
categories  and  work  towards  specifics.  Automatic 
classification  can  produce  such  hierarchical  sets  oi 
categories.  In  a  priori  systems,  nodes  are  given  nam'’S 
and  index  numbers.  However,  lr.  a  posteriori  systems  the 
node  names  are  generated  automatically  and  consist  of 
the  set  of  keywords  which  appear  in  a I 1  tho  nodes  directly 
beneath  (thinking  of  the  hierarchy  as  an  inverted  tree) 
the  node  In  question.  This  requiting  set  of  keywords 
can  be  considered  an  "abstract"  9?  j  of  the  knowledge 


contained  beneath  that  node  In  the  tree.  If  a  set  of 
keywords  Is  too  large,  humans  or  preferably  automatic 
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processes  can  be  employed  to  condense  the  set  and  provide 
a  suitable  title  for  the  node. 

Naturally,  automated  browsing  can  only  be 
effective  In  or  line  systems  through  man-machine  inter¬ 
action.  The  user  can  enter  a  node  through  a  conjunction 
of  keywords.  The  system  would  then  display  the  nodes 
beneath  the  original  nnp  pc  pc  cpcio  c”ch 

as  how  many  documents  there  are  beneath  each  node  or  how 
many  documents  contain  each  displayed  keyword.  When  the 
user  selects  a  branch,  the  cycle  repeats  with  the  new 
node.  If  desired,  one  could  backtrack  up  the  hierarchy 
or  jump  to  completely  different  sections  of  it.  Once  the 
user  has  narrowed  his  search,  he  can  demand  retrieval  of 
some  or  all  of  the  documents  by  specifying  keywords  and/ 
or  categories. 

Another  way  of  browsing  in  a  classified  set  of 
documents  is  to  start  at  the  very  bottom.  Assume  one 
has  a  specific  query  in  mind  and  upon  submitting  it  to 
the  system,  obtains  only  one  document.  If  this  is 
insufficient  one  could  broaden  the  search  by  requesting 
the  display  of  other  documents  in  the  category  of  the  one 
reprieved.  Since  these  documents  are  close  in  content  to 
the  c  'glnal,  they  might  also  be  satisfactory  or  their 
keywords  might  suggest  ways  for  the  user  to  refine  his 
query  in  order  to  retrieve  ether  documents  of  interest. 

None  of  these  modes  of  browsing  could  be  utilized 
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by  files  with  strict  serial  or  Inverted  file  organization. 

2.2.4  Hierarchical:  Pu-ther  Directory  Size  Seduction 
A  small  hierarchy  Is  illustrated  In  Figure  2-4. 

The  keywords  are  represented  by  the  letters  A  -  H  and  the 
nodes  2,  3,  and  4  are  assumed  to  be  terminal  nodes  or  cells. 
The  directory  for  the  Inverted  file  on  cells,  called  the 
key-to-cell  table,  has  15  entries.  Figure  2-5  Illustrates 
the  hierarchy  effect  presented  in  the  previous  section. 

Here,  the  keywords  A,  B,  and  C  have  moved  up  to  node  1 
because  all  .ne  nodes  beneath  node  1  contained  them. 

At  the  same  time  these  keywords  were  deleted  from  the 
lower  nodes.  Now,  the  key-to-node  table  has  only  9  entries. 
This  example  illustrates  a  further  reduction  in 
directory  size  via  use  of  a  key-to-node  table.  Based  on 
experiments  to  be  described  later  in  this  paper,  this 
reduction  seems  to  be  on  the  order  of  about  10-15  percent. 
This  reduction  is  applied  to 

(a)  the  amount  of  memoiy  required  for  the  directory, 

(b)  the  number  of  directory  words  which  must  be 
brought  Into  main  storage  for  each  query, 

and  (c)  the  number  of  directory  words  which  must  be 
procet-'-d  for  each  query. 

These  benfits  are  obtained  at  the  cost  of  increased 
processing  for  each  directory  word.  In  the  example  of 
Figure  2-5.  keyword  A  points  to  node  1  and  keyword  G  to 
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NODE  1 


B  B  B 

c  c  c 

D  P  D 

E  G 

H 

KEY-TO-CELL  TABLE 
KEYS  ABCDEPGH 

CELLS  222223^^ 

3  3  3  ^ 

4  4  4 

Figure  2-4 

Inverted  Pile  on  Cells 


NODE  1  A 


E  G 

H 

KEY-TO-NODE  TABLE 

KEYS  ABCDEFGH 

CELLS  11122344 

4 

Figure  2-5 

Inverted  File  on  Nodes 
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node  4,  A  query  involving  the  conjunction  of  A  and  G 
should  indicate  node  4,  Thus  the  query  decoding  program 
must  realize  that  node  4  is  under  node  1  In  the  hierarchy. 
Proper  program  design  on  suitable  computers  should  minimize 
this  task. 

2.2.5  On-Line  Retrievals;  Memory  Accesses  Reduction 

Most  mass  storage  devices  have  two  components  to 
the  time  required  to  retrieve  a  record.  The  larger 
component  is  the  time  required  for  the  read  mechanism 
to  approach  the  vicinity  of  the  desired  information  (or 
vice-versa).  This  Is  called  the  access  time  and  is  Itself 
made  up  of  two  components,  motion  access  and  latency 
(usually  averaging  one-half  revo.lutlon  of  the  recording 
media).  The  smaller  component  of  the  retrieval  time  is 
the  actual  data  transmission  time.  Typical  characteristics 
of  some  mass  storage  devices  are  shown  in  Table  2-3. 
Comparing  the  total  access  times  with  the  time  required  to 
transmit  2000  bytes,  one  see3  factors  ranging  from  7  for  the 
smaller  devices  up  to  19  for  the  larger  capacity  memories. 
This  illustrates  two  points: 

1)  Once  the  aocess  time  has  been  "spent,"  it  costs 
relatively  little  more  to  read  additional 

data  as  long  as  another  access  time  is  not 
Involved. 

2)  An  appreciable  tlm«  savings  can  be  made  by 


Capacity  Data  Hate  Transmlss'lon  Average  Motion 

(millions  (thousands  of  Tlae-2000  Bytes  Access  Time 
r«f"  bvtes/sec.)  (milliseconds)  _  (mlllisecond.si.___ 


Iharacterlstlcs  of  Typical  Mass  Storage  Devices  [13.33.63.106] 
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re^uclng  the  required  number  of  memory 
accesses. 

These  points  are  very  pertinent  to  on-line 
systems  because  the  lack  of  the  ability  to  batch  queries 
leads  to  a  large  number  of  memory  accesses.  Automatic 
classification  takes  advantage  of  item  (1)  by  grouping 
like  documents  into  cells  which  are  segments  of  memory 
(tracks,  cylinders,  ecc.)  which  do  not  r  ulre  more 
than  one  memory  access.  Thus,  it  costs  little  extra  in 
time  to  retrieve  an  entire  cell  than  it  would  to  retrieve 
a  single  document. 

In  addition,  classification  reduces  the  number  of 
memory  accesses  required  (item  (2)  above)  by  the  very 
fact  that  the  documents  in  a  given  cell  are  close  to 
each  other  in  content.  This  "likeness"  Increases  the 
probability  that  mult'  le  retrievals  for  a  given  query 
would  appear  in  the  same  cell.  This  in  turn  reduces  the 
number  of  cells  accesses  per  query  ari  hence  the  number 
of  memory  accesses  required.  • 

This  reduction  in  memory  accesses  can  be  trans¬ 
lated  into  greater  on-line  capacity  for  a  system. 

Alternatively,  it  might  speed  operations  up  enough  to 
Justify  slower,  but  less  costly  (see  Table  2-3)  mass 
storage  devices. 
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2.3  Contributions  of  the  Dissertation 

The  overall  accomplishments  of  this  dissertation 

are* 

1)  Definition  of  some  of  the  problems  involved  In 
automated  large-scale  information  storage 
and  retrieval  systems. 

2)  Provision  of  a  superior  method  of  solution 
for  these  problems. 

3)  Demonstration  of  the  feasibility  and  advan¬ 
tages  of  this  solution. 

The  methodology  on  which  this  solution  Is  based  is 
that,  of  automatic  classification  of  the  document,  collection. 
Feasibility  is  demonstrated  by  automatically  classifying 
a  file  of  50,000  document  descriptions.*  The  advantages 
of  automatic  classification  are  demonstrated  by  estab¬ 
lishing  methods  for  measuring  the  quality  of  classification 
systems  and  applying  these  measures  to  a  number  of  dif¬ 
ferent  classification  strategies.  By  indexing  the  50,000 
documents  by  Independent  methods,  It  is  shown  that  these 
advantages  are  not  dependent  upon  the  indexing  method 
used. 

The  advantages  demonstrated  are« 

1)  Automaticlty. 

2)  Flexibility. 

3)  Browsabillty. 

4)  Reduction  In  storage  space. 
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5)  Reduction  In  retrieval  tine. 


CHAPTER  3 
LITERATURE  REVIEW 

3.1  Content  of  this  Chapter 

The  main  body  of  this  chapter  is  concerned  with 
a  critical  review  of  prior  publications  in  the  area  of 
automatic  classification  theories  and  experiments.  How¬ 
ever,  because  of  the  importance  of  Indexing  and  Its 
close  ties  with  classification,  a  descriptive  section  on 
Indexing  Is  included.  Another  reason  for  Including  a 
review  of  Indexing  efforts  Is  that  In  preparing  one  of 
the  Indexes  for  the  experimental  file  used  for  this  dis¬ 
sertation,  the  author  utilized  a  formof  automatic 
Indexing  (see  Appendix  A). 

3 • 2  General  Critique  of  Prior  Experiments 

There  3s  one  item  which  has  beep  remarkably  con¬ 
stant  In  all  research  on  automatic  classlf cation  to  date 
(the  present  study  and  the  work  leading  up  to  It  excluded). 
That  is  the  lack  of  experiments  on  a  significantly  large 
data  base.  The  largest  corpus  used  for  automatic 
classification  experiments  reported  in  the  literature 
contains  about  one  thousand  documents  (see  Section  3»3). 

It  is  difficult  to  imagine  how  one  could  obtain  a  reason¬ 
able  number  of  significant  categories,  much  leas  a  reason¬ 
able  hierarchy,  with  so  few  documents  (most  experiment* 
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were  done  on  fewer  tha  400  documents).  The  current 
experiments  have  clearly  shown  the  need  for  large-scale 
experiments.  For  example,  it  has  been  found  (reported 
in  Chapter  5  of  this  paper)  that  in  a  certain  aspect, 
based  on  2500  documents,  an  inverted  file  outperforms 
a  classified  file.  If  the  experiments  stopped  there 
the  results  would  have  been  in  error,  for  the  classified 
file  caught  up  to  and  far  surpassed  the  performance  of 
the  inverted  file  as  the  number  of  documents  processed 
were  increased  to  almost  50,000. 

Classification  takes  a  number  of  different  forms. 

As  stated  previously,  most  classification  systems  are 
a  priori  and  have  h  unans  do  the  classifying.  The  newer 
of  these  systems  generally  use  categories  as  Indexes, 
thereby  placing  a  document  in  more  than  one  category. 

These  systems  have  come  into  being  in  order  to  overcome  the 
disadvantages  of  the  Dewey  decimal  claaslf icatlon.  How¬ 
ever,  they  nave  only  partially  succeeded;  and.  In  addition, 
have  generally  increased  the  notatlonal  complexity. 

Examples  of  such  systems  are  the  Colon  C’lassif ication 
[6,i0?l  and  the  Universal  Decimal  Classification  [49,50, 

52 , 53. 54 ,1 1 5 ]•  Reviews  of  these  and  other  faceted 
classification  schemes  can  be  found  in  Vickery  [i46] 
and  Taulbee  ^136]. 

In  the  real*  of  automatic  classification,  one  can 
Identify  two  levels  of  automation.  One  is  the  automatic 
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placement  of  documents  into  a  priori  categories.  The 
other  Is  the  use  of  automated  techniques  to  derive  the 
classification  categories  (a  posteriori)  and  then  place 
the  documents  Into  these  categories.  Experiments  have 
been  performed  on  each  of  these  levels,  though  it  1s  felt 
that  the  latter  Is  by  far  the  more  significant.  The  need 
for  a  good  a  posteriori  automatic  classification  technique 
can  be  found  in  the  literature.  Altmann  [2]  notes  that 
the  potential  users  of  a  system  under  design  demanded 
"browsa'ol  1  lty. "  In  the  absence  of  a  decent  automatic 
system,  a  new,  a  priori  classification  system  was  designed 
specifically  fcr  their  collection.  In  designing  a  system 
for  the  Air  Force,  Debons,  et  al .  [4o]  noted  the  limita¬ 
tions  of  all  a  priori  classification  systems  and  reluc¬ 
tantly  decided  that  of  the  available  systems,  strict 
coordinate  indexing,  with  no  classification  was  the  route 
to  take.  Lefkovltz,  et  al.  [78,80,143]  started  using 
a  posteriori  automatic  classification  In  a  large-scale 
real-time  chemical  Information  system  but  abandoned  it 
to  Inverted  files  due  to  the  lack  of  data  on  the  quality 
of  automatic  classification  on  large  flies. 

Very  few  experiments  have  been  performed  on 
i  ere  -er.  1  cal  a  posteriori  classification  system*.  A 
;  t.'ib.e  exception  Is  the  work  of  Doyle  [42, 43].  As  stated 
ireviously,  a  hierarchy  Is  required  In  order  to  have  full 
browsing  capabilities.  In  a  1964  review  of  the  stat"  of 
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the  art  of  IS&R  systems,  Arnovlch,  et  al.  (_4]  did  not 
consider  the  possibility  of  hierarchical,  a  posteriori 
classification. 

A  major  obstacle  to  the  development  of  any  type 
of  classification  system  is  measurement  of  quality.  Why 
design  a  system  If  there  is  no  way  to  tell  if  It  is  better 
or  worse  than  others?  Until  the  present  set  of  experiments, 
most  classification-  systems  were  measured  by  one  or  more 
of  the  following  methods! 

(a)  relevance  assessments  of  documents  in  categories 
wj th  respect  to  a  few  search  requests, 

(b)  were  documents  placed  into  the  same  categories 
a  human  classifier  would  have  placed  them?, 

(c)  are  a  posteriori  categories  the  same  as 
a  priori,  hunr  n-organlzed  categories?, 

(d)  do  the  categories  "look  good"?  (subjective 
criterion) . 

Relevance  assessment  (a)  of  documents  to  requests  should 
not  be  used  In  testing  classification  techniques.  This 
use  of  relevance  confuses  classification  with  indexing. 

With  the  use  of  this  measure,  one  cannot  separate  the 
quality  of  Indexing  from  the  quality  of  classifying  and 
is  more  likely  to  be  measuring  the  former.  Secondly, 
the  value  of  relevance  and  precision  ratios  being  used  in 
any  of  the  ways  they  are  today  is  open  to  question.  A 
number  of  papers  have  been  written  pointing  out  the 
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shortcomings  cf  the  "pseudo-mathematical"  use  of- these 
ratios  [34,35,45.71 .121  .133.135]. 

Items  (b)  and  (c)  above  assume  some  degree  of 
a  priori  categories.  In  some  experiments  the  categories 
are  set  up  a  posteriori  on  part  of  the  collection  and  used 
to  automatically  classify  the  rest  of  the  collection. 

In  other  jases  the  categories  are  set  up  a  priori  by 
humane.  In  either  case,  human  judgment  of  "correct" 

i 

document  classification  or  "correct"  category  content 
is  required.  This  is  undesirable  for  two  reasons. 

Firstly,  humans  are  not  terribly  consistent  in  indexing 
or  classifying  documents  [17.  93.  122].  Secondly,  the  goal* 
of  automatic  classification  are  to  serve  the  user  as 
efficiently  as  possible  and  not  to  conform  the  system  to 
preconceived  ideas  of  categories  of  knowledge.  It  might 
be  the  case  that  these  are  one  and  the  same,  but  this 
judgment  must  await  experimental  verification  before  it 
can  be  accepted. 

Measure  (d)  above  can  be  dismissed  as  vague, 
inconsistent,  and  not  readily  amenable  to  verification. 

One  attempt  at  an  objective  measure  of  classifi¬ 
cation  can  be  found  in  Doyle  [43].  Here,  a  collection  of 
time-ordered  items  (daily  work  records  and  diaries)  and 
portions  of  documents  was  classified.  The  criterion  used 
was  how  well  the  categories  could  isolate  continuous  seg¬ 
ments  of  time  and  tie  together  parts  of  the  same  documents. 
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One  could  not  do  direct  extrapolation  of  these  results 
to  a  more  usual  collection  of  documents ,  but  It  Is  a 
start  towards  objective  measurements.  Unfortunately,  the 
collection  consisted  of  only  100  Items, 

The  measurement  techniques  used  for  this  disserta¬ 
tion  are  explained  later.  However,  for  comparison  with 
the  above  they  will  be  briefly  described  here.  These 
measures  can  be  applied  to  any  classification  scheme 

o 

.(see*  Chapter  4  for  five  schemes  to  which  they  were  applied) 
without  human  judgment.  As  described  In  Section  2,1,  a 
count  of  the  average  number  of  discrete  keywords  per 
category  yields  a  measure  (to  be  minimized)  of  the  "like- 
ness"  of  documents  in  a  category.  In  ‘addition,  If  search 
requests  are  applied  to  a  system  ( 1 65  requests  were  used 
In  this  set  of  experiments),  the  number  of  categories 
looked  at  as  well  as  the  number  of  documents  searched  in 
these  categories  should  be  minimized.  These  measures  can 
be  used  to  compare  the  quality  of  two  or  more  classification 
systems.  In  addition,  the  plot  of  keys  per  category  versus 
number  of  categories  could  be  thought  of  as  an  absolute 
measure,  using  the  diagonal  line  of  Fig,  2.1  as  a  referenoe 
line.  However,  this  must  be  tempered  by  the  fact  that 
different  collections  probably  have  different  relative 
minima  for  the  average  keys  per  cell.  Therefore,  for  best 
measurements,  experiments  must  be  done  on  the  same  or 


similar  files 
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One  cf  the  reasons  generally  given  for  using  few 
documents  In  an  experiment  is  that  automatic  classification 
of  large  numbers  of  documents  requires  much  time  and/or 
high-speed  memory  (which  can  be  converted  into  time  via 
use  of  secondary  storage)  and  that  one  or  both  of  these 
factors  are  not  available  or  cost  too  much  to  Justify. 
Experiments  are  then  performed  with  few  documents  but 
with  full  realization  that  actual  systems  would  incur 
great  expense  In  trying  to  classify  large  numbers  of 
documents  by  the  experimental  methods  [73j»  In  most 
automatic  clas.  flcation  systems  being  considered  today 
(clustering  types),  classification  time  is  proportional 
to  the  square,  or  even  the  cube,  of  the  number  of  docu¬ 
ments  in  the  system.  This  is  because  of  the  need  to 
compare  every  document  (or  partial  category)  with  every 
other  document  (or  partial  category)  or  to  generate  and 
manipulate  matrixes  whose  sides  are  proportional  io  the 
number  of  documents  and/or  the  number  of  discrete  keywords 
in  the  system  (see  Doyle  [44],  "Breaking  the  Cost  Barrier 
in  Automatic  Classification").  This  means  that  the  cost 
of  classification  per  document  goes  up  at  least  11  nearly 
with  the  number  of  documents,,  Considering  collections 
numbering  in  the  millions  of  documents,  it  is  evident 
that  systems  with  the  above  characteristics  are  unaccept¬ 
able. 

There  are  two  systems  which  are  known  to  break 
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this  effect.  One  Is  proposed  (discussed  on  page  50) 
by  Doyle  [44]  and  the  other  is  presented  in  this  disserta¬ 
tion.  Both  are  hierarchic,  a  posteriori  classification 
systems  which  work  from  the  top  (all  the  documents)  down 
(to  the  categories).  In  both,  the  time  proportionality 
factor  (N^  documents)  is  approximately  N^logN^,  where  the 
logarithmic  base  is  the  number  of  branches  at  each  node  of 
the  hierarchy. 

The  cost  involved  with  processing  millions  of 
documents  eliminates  the  application  of  more  sophisti¬ 
cated  (i.e.,  theoretical)  and/or  deterministic  techniques 
towards  the  problems  of  automatic  classification.  Few 
would  doubt  the  possibilities  of  translating  the  problems 
of  automatic  classification  into  the  realm  of  automata 
or  linear  programming  theory.  However,  the  practicalities 
of  millions  of  different  documents  with  tens  of  thousands 
of  discrete  keywords  and  tens  of  millions  of  keyword 
appearances  eliminates  the  use  of  these  otherwise  attrac¬ 
tive-looking  devices. 

Before  the  examination  of  particular  automatic 
classification  schemes,  a  word  should  be  said  about  another 
type  of  classification.  This  Is  classifying,  grouping,  or 
relating  index  terms  to  aid  retrieval  but  not  to  be  used 
to  group  documents.  These  schemes  could  have  merit  in 
certain  cases,  but  are  not  of  direct  interest  here.  Ex¬ 
amples  can  be' found  in  (45, 56, 74, 75*79.114,130, 154  ]. 
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The  most  extensive  of  these  studies  was  done  by  Zimmerman 
[15*0  who  automatically  classified  the  keywords  in  25000 
document  descriptions  into  Exclusive  and  Inclusive  groups. 
However,  possibly  due  to  the  use  of  only  85  keywords,  his 
results  are  not  very  encouraging  for  the  future  of  this 
type  of  classification. 

3* 3  Automatic  Classification 

Table  3-1  (a-d)  summarizes  some  significant  facts 
about  previous  automatic  classification  experiments.  This 
table  and  the  cited  references  (plus  the  following  discus¬ 
sion)  are  presented  in  lieu  of  a  detailed  description  of 
each  classification  algorithm  and  experiment.  Many  reviews 
are  available  which  describe  the  actual  algorithms  used  to 
set  up  the  categories  or  the  statistical  processes  used 
to  classify  the  documents  [16,59,72,73,136,147],  but  none 
summarize  the  vital  statistics  for  more  than  a  very  few 
experiments.  Other,  more  general,  reviews  of  automatic 
classification  are  also  available  [19,128,129]  imbedded  in 
reviews  of  broader  areas  of  interest. 

Lanoe  and  Williams  [72,73]  divide  a  posteriori 
classification  strategies  into  hierarchical  systems  and 
clustering  systems.  They  further  subdivide  hierarchical 
systems  into  aggloaerative  methods  and  divisive  methods. 

In  agglomeratlve  methods  (the  only  type  they  consider)  the 
hierarchy  is  formed  by  combining  documents,  groups  of 
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documents  ,  and  groups  of  groups  of  documents  until  all 
documents  are  In  one  large  group:  trie  entire  collection 
itself.  The  hierarchy  being  thus  formed,  all  that  remains 
is  to  select  some  criterion,  such  as  category  size,  by 
which  one  cuts  off  the  bottom  of  the  hierarchy,  thereby 
producing  categories.  Experiments  using  such  a  mothod  were 
performed  by  Doyle  [42,43]  using  the  Ward  grouping  program 
[148],  Prywes  [97.98.99]  has  also  devised  a  system  of 
this  type.  Wolf berg  [153]  has  done  some  preliminary  work 
on  this  algorithm,  but  because  of  computational  difficulties 
with  large  collections,  no  large-scale  experiments  have  yet 
been  performed. 

Divisive  techniques  have  long  been  thought  the 
realm  of  philosophers  and  other  designers  of  a  priori  break¬ 
downs  of  knowledge.  With  this  technique,  one  starts  with 
the  entire  collection  and  successively  subdivides  It  until 
appropriately  sized  categories  are  obtained.  Doyle  [44] 
has  proposed  a  system  of  this  type  (see  Dsttols  [39]  for 
preliminary  experiments).  However,  this  syetea  requires 
some  s  priori  categories  as  a  starting  point  at  each  leval 
of  classification.  Whether  or  not  this  can  be  overcome 
(it  probably  can)  remains  to  be  seen. 

The  algorithm  used  (among  others)  In  this  paper  - 
called  "CLASP!"  -  is  aI*o  of  the  hierarchical  divisive 
type,  but  Is  of  a  self-starting  variety.  Previous 
experiments  performed  during  the  development  of  CLASP! 
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,  are  summarl<i._d  In  Table  3-2.  For  completeness,  the  present 
experiments  using  CLASPY  are  summarized  In  Table  3-3» 

Clustering  systems  Involve  a  wide  variety  of 
classification  techniques  which  seek  to  group  Index  terms 
or  documents  with  high  association  factors  together 
Into  "clusters",  "clumps",  or  "factors"  without  trying  fco 
obtain  a  hierarchy.  Some  examples  of  experiments  with 
some  of  these  methods  are  shown  in  Table  3-1 •  Moat  of 
these  methods  require  matrix  manipulation,  though  It 
should  be  added  that  the  precise  manner  of  these  manipu¬ 
lations  varies  widely  with  the  particular  scheme  used. 
Another  scheme  of  thl_  ^aneral  type  Is  latent  cla.*3  analy¬ 
sis,  proposed  by  Baker  [7,8].  This  method  oan  utilize 
correlations  of  triplets  or  larger  sets  of  Index  terms  as 
well  as  pairs  of  terms  as  utilized  by  the  other  methods. 
There  have  been  no  reports  of  experiments  using  latent 
class  analysis. 

Price  and  Schlmlnovich  [96]  have  recently 
manually  simulated  automatic  clustering  of  240  physics 
documents  using  blbll ographlo  citations  Instead  of  keywords 
as  the  basis  for  the  clustering  algorithm.  Even  though 
this  technique  might  be  sa-  laf factory  under  certain  limited 
conditions,  because  of  the  variability  of  uuthorn  in  citing 
references  It  la  doubtful  that  this  could  be  used  as  a 
general  method.  The  quality  orlterion  used  In  the  above 
experiment  was  equivalent  to  (o)  of  Section  3*2. 


Source  Angell  [3]  Lefkovltz  and  NOTE 

Angell  [77] 

5.  Produced  by  random 

Year  of  Publication  1966  1966  number  generation. 

Related  Pubs.  [76]  [31] 
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Summary  of  Present  Experiments 


Some  other  papers  of  interest  to  automatic  classifi¬ 


cation  are*  O'Connor  [91]  (an  "old*  article  on  classifi¬ 
cation  designed  for  peek-a-boo  cards) ,  Chien  and  Preparata 
[28]  (file  organization  after  automatic  classification), 
Soergel  [124]  (highly  theoretical  -  doubtful  practical 
application),  Nagy  [87]  (application  of  various  automatic 
classification  techniques  to  pattern  recognition),  and 
Sokal  [125]  (numerical  taxonomy ,  or,  the  use  of  automatic 
techniques  in  biological  classification). 

3.4  Indexing  and  Automatic  Indexing 

Many  more  experiments  have  been  carried  out  cn 
Indexing  than  on  classification  for  automated  IS&R  systems. 
In  addition,  the  indexing  experiments  were  generally 
designed  to  be  much  more  effective  in  selecting  good 
Indexing  systems  than  were  the  classification  experiments 
in  selecting  good  classification  systems.  For  example, 
out  of  26  index  evaluation  projects  reported  in  tabular 
form  by  Bourne  [22],  21  of  them  Involved  comparative 
evaluations  of  from  two  to  as  many  as  about  15  different 
indexing  schemes,  some  automatic  and  some  not.  Stevens 
[128]  presents  a  good  state-of-the-art  report  on  automatic 
indexing  and  related  problems  as  of  early  1965.  More 
recent  reviews  of  automatic  Indexing  are  also  available 
[9»19tl08].  Henderson  [59]  recently  gathered  informative 
abstracts  of  a  number  of  papers  dealing  with  IS&R  systems 
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wlth  particular  emphasis  on  those  dealing  with  Indexing. 

Once  again,  a  significant  deficiency  In  most 
automatic  Indexing  experiments  is  the  small  collections 
upon  which  conclusions  are  based.  A  notable  exception  to 
this  are  the  experiments  of  Guillano  and  Jones  [55]  where 
collections  of  10,000  documents  were  used.  Other  defi¬ 
ciencies  of  automatic  Indexing  experiments  are  similar  to 
some  of  those  (particularly  the  notion  of  relevance) 
described  in  Section  3*2  for  automatic  classification. 

Since  this  paper  is  not  directly  concerned  with 
Indexing,  specific  indexing  projects  will  not  be  reviewed. 
Instead,  a  few  words  will  be  said  about  comparative 
■  indexing  experiments,  especially  those  pertaining  to  the 
type  of  automatic  indexing  described  in  Appendix  A. 

Reasearch  on  comparative  indexing  has  progressed 
from  comparing  manual  Indexes  (such  as  Cleverdon,  et  al. 
[30])  to  more  recent  work  on  compering  manual  indexing 
with  various  forms  of  automatic  indexing.  A  detailed 
analysis  of  various  modes  of  automatic  indexing  is  being 
performed  by  project  SMART  under  the  guidance  of  Salton 
[116,117,118,120].  Some  of  the  items  under  study  are 
document  length  (title.  jstracts  vs.  full  text), 

matching  functions  and  term  weights,  language  normaliza¬ 
tion  (delete  suffix  "s",  word  stems,  full  thesaurus,  eto. ) , 
manual  Indexing,  and  synonym  and  phrase  recognition. 
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Salton  has  found  that,  In  general,  for  his  test  collection* 
detailed  manual  Indexing  (over  30  terms  per  document)  la 
slightly  better  than  the  automatic  Indexing  techniques 
used,  indexing  on  abstracts  Is  better  than  on  title  alone, 
and  use  of  a  thesaurus  involving  synonym  recognition  is 
more  effective  than  word  stems  which  is  slightly  better  than 
deleting  only  suffix  "s"  and  common  words. 

Other  experiments  have  been  performed  comparing 
indexing  by  title  words  vs.  abstract  words  [68,109]  and 
, title  words  .(usually  KWIC,  keyword-ln-context )  vs.  manual 
indexes  [1,12,24,29,69].  The  general  consensus  Is  that 
titles  of  technical  articles  are  sufficiently  descriptive 
to  be  used  for  automatic  indexing  but  that  abstzaats  would 
probably  serve  somewhat  better.  These  results  (including 
those  of  project  SMART)  were  used  In  semi -automat lolly 
indexing  almost  50,000  documents  (see  Appendix  A)  used  in 
the  current  experiments. 


CHAPTER  4 


EXPERIMENTAL  CLASSIFICATION  STRATEGIES 

4.1  Introduction 

This  chapter  contains  descriptions  of  the  various 
classification  algorithms  used  In  these  experiments.  The 
classification  experiments  themselves  are  described  In  the 
next  chapter. 

Five  different  algorithms  were  studied.  Three  of 
these  are  a  posteriori,  one  a  priori  and  one  random,  for 
comparison.  Of  the  three  a  posteriori ‘systems,  only  one 
Is  basically  of  a  hierarchical  nature  ( CLASFY ) .  This 
system  was  studied  with  numerous  variations  of  parameters 
and,  as  shall  be  seen,  input  orderings.  It  was  found  to 
be  the  best  among  those  Investigated.  Because  of  the  time 
required  for  processing  of  the  large  files,  much  of  the 
parameter  optimization  was  done  on  the  small  keyword  file. 
In  actual  system  operation  the  time  required  to  classify 
a  large  file  (assuming  processing  time  Increases  no  faster 
than  NjjlogN^  -  see  Chapter  3)  Is  not  so  Important  because 
the  classification  would  be  performed  once  and  not  repeated 
until  a  substantial  number  of  new  documents  have  entered 
the  system. 

In  all  of  the  systems  studied,  a  document  ac¬ 
quired  between  classifications  or  memory  reorganizations 
would  be  placed  Into  a  cell  based  on  the  original  concept 
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of  the  particular  classification  system  in  question. 

All  CLASFY  processing  was  done  on  an  IBM  7040 
computer  in  MAP.  All  other  processing  was  done  on  an  IBM 
360/65  computer  in  PL/I. 

4.2  A  Hierarchical  Classification  Algorithm  (CLASFY) 

The  original  version  of  the  primary  classification 
algorithm  under  consideration  here  was  conceived  by  Dr.  David 
Lefkovltz  [76*]  and  was  programmed  by  Angell  [3]  in  MAP  on 
an  IBM  7040  computer.  Since  then  it  has  been  used  to  auto- 
matlcally  classify  a  document  file  for  the  Air  Force  [31,77]. 
In  addition,  it  was  used  for  a  while  in  an  experimental  chem¬ 
ical  ISAR  system  [78,80,143]  but  was  discontinued  because 
of  lack  of  information  about  its  performance  on  large  files. 
However,  until  now,  no  means  was  available  for  measuring 
the  quality  of  this  algorithm.  In  the  course  of  the  ourrent 
work,  the  above  algorithm  was  Improved  and  evaluated,  and 
then  compared  with  other  classification  systems. 

4.2.1  Description  of  the  Algorithm 

CLASFY  is  a  hierarchical  classification  algorithm 
of  the  divisive  type.  That  is,  one  starts  with  the  entire 
collection  and  successively  partitions  it  into  smaller  and 
smaller  groups  until  a  group  size  criterion  is  met.  These 
final  groups  are  called  cells  and  are  the  actual  categories 
into  which  the  documents  are  placed. 
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Each  node  Is  treated  Independently  of  the  others. 

In  other  words,  the  algorithm  Is  first  applied  to  the  entire 

collection.  This  results  In  partitioning  the  collection 

Into  N  groups,  represented  by  nodes  In  the  classification 

tree.  The  selection  of  an  appropriate  node  stratification 

number,  N  (l.e.,  number  of  branches  out  of  a  node  or  number 

of  groups  Into  which  each  node  Is  partitioned)  is  discussed 

In  Section  4. 2. 4.1.  The  algorithm  Is  then  reapplied  to 

each  of  the  resulting  N  nodes  yielding  N  additional  groups 

(assuming  a  constant  stratification  number)  for  each  of 

these  nodes.  By  row  the  collection  has  been  divided  Into 
2 

N  mutually  exclusive  groups  of  documents.  This  Is  con¬ 
tinued  until  a  size  criterion,  such  as  number  of  documents 
per  group  or  number  of  computer  words  per  group  Is  met  for 
each  resulting  group.  Because  collections  are  not  com¬ 
pletely  homogeneous,  the  size  criterion  will  generally  be 
met  at  different  tree  levels  lor  different  portions  of  the 
classification  tree.  Therefore,  In  general,  the  resulting 
tree  will  not  be  a  "regular"  tree,  terminating  throughout 
at  the  same  level. 

Each  node  is  represented  by  the  keyword  surrogates 
of  the  documents  at  that  node  and  by  the  keyword  vocabu¬ 
lary  made  up  of  the  union  of  the  keywords  of  these  sur¬ 
rogates.  The  algorithm  (operating  at  any  given  node)  Is 
based  on  three  principles. 

1)  The  keyword  vocabulary  Is  to  be  partitioned 
Into  name  number  of  groups  such  that  every 
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document  description  (of  -the  documents  at  that 
node)  Is  represented  In  at  least  one  of  the 
resulting  groups. 

2)  The  groups  should  be  constructed  such  that 
each  document  description  appears  In  as  f ew 
groups  as  possible. 

3)  The  number  of  keywords  In  each  group  should  be 
roughly  equal. 

It  should  be  noted  that  If  principle  3)  was  not 
included,  the  solution  to  1)  and  2)  would  be  to  place  all 
documents  Into  one  group.  Of  course,  this  Nould  not 
result  In  any  partitioning.  The  word  "roughly"  In  prin¬ 
ciple  3)  Is  executed  by  defining  a  sensitivity  factor,  E. 

An  attempt  Is  made  to  al tow  the  number  of  keywords  In 
each  group  to  differ  from  those  of  any  other  group  by  no 
more  than  E  (this  can  not  always  be  ^one).  Further  dis¬ 
cussion  of  the  sensitivity  factor  can  be  found  In  Section 
4. 2.4. 2. 

Even  though  a  document  description  may  appear  In 
more  than  one  resulting  group.  It  Is  assigned  to  only  one 
of  these  groups  (see  the  actual  algorithm  below). 

The  algorithm  Itself  consists  of  a  three  ass  pro¬ 
cess.  That  is,  at  each  node,  the  keyword  surrogates  of 
the  documents  at  tnat  node  are  linearly  scanned  three  times 
(actually,  the  third  scan  does  not  look  at  all  the  documents 
and,  Jn  fact,  at  times  Includes  few  or  no  documents). 
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Looklng  at  the  entire  tree,  this  means  that  the  entire 
collection  of  documents  Is  scanned  linearly  In  three 
passes  at  each  tree  level.  Since  the  time  required  for 
linear  scans  varies  In  proportion  to  the  number  of  docu¬ 
ments  and  the  number  of  levels  varies  with  the  logarithm 
of  the  number  of  documents,  It  can  be  seen  where  N^log 
(where  the  logarithmic  base  Is  N,  the  stratification 
number)  was  arrived  at  for  the  proportionality  factbr  for 
classification  time  (see  Section  3«2). 

In  the  following  description  of  the  classification 
algorithm  for  CLASFY,  N  represents  the  stratification 
Number,  5  the  sensitivity  factor,  and  D  a  particular 
document  description  (i.e.,  keyword  surrogate).  Figure 
4-1  presents  a  macro-flowchart  of  CLASFY.  For  more  detailed 
flowcharts  of  the  original  version  of  CLASFY,  see  Angell 

[3]. 

PASS  1 

This  pass  partitions  the  keyword  vocabulary  of  a 
node  Into  N  non-exclusive  groups  by  adding  the  keywords 
of  each  document,  one  at  a  time,  to  one  of  the  N  groups. 

1)  Number  the  resulting  groups  1,2,3...N.  Ini¬ 
tially,  all  groups  have  no  keywords.  The  file 
1 s  positioned  at  the  beginning. 

2)  The  next  description,  D,  is  read.  Denote  the 


group  which  contains  the  most  keywords  of  D, 
group  1.  If  there  are  two  or  more  such  groups. 
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FIGURE  4-1 
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denote  the  one  with  the  fewest  distinct 
keywords  as  group  1.  If  there  are  dtill  two 
or  more  groups,  arbitrarily  select  the  one 
with  the  lowest  group  number  as  group  1. 

3)  Let  the  number  of  keywords  In  group  1  be  de¬ 
noted  n^  and  the  number  of  keywords  of  D  not 
In  group  1  (l.e.,  the  number  of  keywords  which 
would  have  to  be  added  to  group  1  If  D  were 
Included  In  that  group)  be  denoted  as  a,.  The 
following  Inequality  is  tested  for  J  =  1,2... N 

J  4  1: 

(n1  •*-  a^  £  (nj  ♦  a^)  +  E. 

If  true,  that  is,  If  the  new  size  of  group  1 
is  no  more  than  E  greater  than  the  potential 
new  size  of  any  other  group,  the  keywords  of  D 
are  added  (union)  to  the  keywords  of  group  1. 
Otherwise,  set  1  =  J  (that  J  for  which  the 
above  expression  Is  not  true)  and  continue  the 
above  test  on  the  remainder  of  the  groups. 

L)  If  this  1 3  the  last  document,  return  to  the 

beginning  of  the  file  and  go  on  to  Pass  2.  If 
not,  go  to  Item  2)  of  this  pass. 

It  can  be  seen  that  this  process  guarantees  that 
the  keywords  of  every  document  description  are  included  In 
at  least  one  group.  However,  no  documents  have  been  as¬ 
signed  to  any  group. 
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PASS  2 

This  oass  assigns  those  documents  whose  descrip¬ 
tions  appear  In  only  one  group  to  that  specific  group. 
Documents  with  descriptions  in  more  than  one  group  are 
deferred  for  Pass  3  processing. 

1)  The  next  description,  D,  Is  read.  If  all  the 
keywords  of  D  appear  in  only  one  group,  those 
keywords  (of  D)  are  flagged  In  that  group  and 
D  Is  assigned  to  that  group.  The  flagged  key¬ 
words  are  essential  because  no  other  group  con¬ 
tains  all  the  keywords  of  D. 

2)  If  the  keywords  of  D  appear  in  more  than  one 
group,  no  keywords  are  flagged  and  D  is  written 
on  an  intermediate  file  for  Pass  3  processing. 
This  indicates  that  a  redundancy  exists. 

3)  If  this  is  the  last  document,  position  the 
intermediate  file  at  the  beginning  and  go  on 

tc  Pass  3*  If  not,  go  to  item  1)  of  this  pass. 

At  this  point,  some  of  the  documents  have  been 
assigned  groups  and  some  of  the  keywords  in  the  groups  have 
been  flagged. 

PASS  3 

This  pass  of  redundant  descriptions  from  Pass  2 
attempts  to  minimize  description  redundancies  among  the 
groups  of  keywords. 

1)  The  next  description,  D,  on  the  intermediate 
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flle  if  reed. 

2)  If  the  keywords  of  D  are  all  flagged  within  at 
least  one  group,  assign  D  tc  the  first  such 
group  encountered  (other  methods  can  also  be 
used  to  select  which  of  these  groups,  If  more 
than  one,  to  which  D  should  be  assigned).  If 
the  keywords  of  D  are  not  all  flagged  within 
any  group,  consider  the  groups  which  contain 
all  the  keywords  of  D.  Of  these,  determine 
which  one  has  the  most  keywords  of  D  flagged 
(if  more  then  one,  arbitrarily  choose  the  one 
with  i,he  lowest  grc  p  number).  Assign  D  to 
that  group  and  flag  the  remainder  of  the  key¬ 
words  of  D  In  that  group. 

3)  If  this  is  the  last  document,  processing  is 
complete  for  this  node.  If  not,  go  to  item 
1)  of  this  pass. 

A1 1  documents  have  now  been  assigned  groups.  The 
Unf larged  keywords  in  each  group  are  redundant  and  are  not 
contained  in  any  document  description  in  that  group.  These 
new  nodes  are  now  ready  for  repartitlonlng,  if  desired. 

When  the  cell  criterion  has  beer  met  by  a  particular  groups 
it  is  considered  to  be  a  cell  and  the  keys  associated  with 
that  cell  are  the  flagged  keys  of  the  group. 

Figure  4-2  shows  part  of  a  classification  via 
CLASFY  of  the  large  keyword  file.  Pertinent  parts  of  the 
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hlerarchy  are  also  shown.  This  particular  node  (number  5) 
contains  4267  documents  lltems)  and  N  =  5.  E  «  75.  and  C  « 
460  documents.  It  should  be  noted  that  group  5  passes  the 
size  criterion  (393  documents)  and  therefore  Is  a  terminal 
node  (node  26),  or  cell. 

4.2,2  Classification  Example 

The  difficulty  with  presenting  examples  of  classi¬ 
fication  is  that  systems  such  as  CLASFY  were  designed  for 
large  numbers  of  documents.  Their  use  on  small  collections 
will  often  produce  poor  classifications.  With  this  in 
mind,  the  following  example  was  chosen  to  illustrate  aspects 
of  the  various  classification  algorithms,  and  not  because 
"good"  classifications  will  be  produced, 

A  file  of  14  document  descriptions  are  displayed  in 
Figure  4-3,  The  keywords  of  the  documents  were  ordered  and 
replaced  by  rank  numbers  as  described  in  Section  A, 2, 

These  integer  keywords  will  be  used  ns  the  document  des¬ 
criptors,  This  file  will  be  classified  with  N  =  2  and  E 
=  0.  The  classification  will  be  carried  out  on  two  levels, 
disregarding  any  cell  criteria. 

Figure  4-4  shows  the  three  pass  partitioning  of  the 
top  node  (14  documents).  The  keywords  are  shown  In  the 
order  that  they  were  added  to  the  groups.  Note  that  In 
Pass  3.  documents  D1  and  DIO  were  added  to  group  1  and 
keywords  8  and  9  were  found  to  be  redundant  and  were 
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Document  Name 

Keywords 

Integer  Keywords 

D1 

A  B 

2  3 

^2 

A  C  D 

2  4  5 

D3 

A  C  F 

2  4  6 

D4 

AEP 

2?6 

D5 

B  D  K 

3  5  1 

D6 

B  C 

3  4 

D7 

D  M  N 

5  10  11 

D8 

E  K 

7  1 

D9 

F  G  H 

6  12  13 

DIO 

I  J 

8  9 

DU 

J  K 

9  1 

D12 

I  K 

8  1 

D13 

K  L 

1  14 

D14 

M  N  0 

10  11  15 

Figure  4-3 

A  Sample  Pile  of  Document  Descriptions 
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'  deleted  from  group  2. 

Each  of  the  two  groups  of  documents  produced  by 
the  partitioning  (see  Pass  3.  Fig*  4-4)  are  now  successively 
partitioned  to  form  a  third  level.  The  resulting  three 
level  binary  tree  is  shown  In  Figure  4-5.  The  numbers  in 
the  boxes  represent  the  keywords  of  the  group  which  formed 
that  node.  This  tree  (but  not  the  document  groups)  will 
be  modified  later  on  in  this  chapter  in  order  to  facilitate 
browsing. 

If  further  partitioning  were  desired,  some  or  all 
of  the  four  groups  of  documents  could  be  used  as  Input  to 
the  next  level  partitioning.  For  example,  If  the  maximum 
documents  per  cell  was  set  at  *hree,  only  Group  I  would 
be  repart It 1 oned ,  the  rest  becoming  terminal  nodes,  or 
cells. 

4.2.3  Unusual  Situations 

Two  related  situations  which  are  not  taken  care  of 
In  the  basic  CLASFY  algorithm  were  encountered  In  proces¬ 
sing  the  large  files. 

The  first  can  occur  upon  a  proper  combination  of 
a  relatively  large  sensitivity  factor,  relatively  small 
number  of  documents  (l.e..  Just  above  the  cell  criterion), 
and  relatively  small  vocabulary  of  keywords  contained  in 
these  documents.  When  these  conditions  allow,  some  of  the 
N  groups  formed  are  empty.  This  occurs  because,  during 


D1  -2  3 

D4-2  7  6 

D2-2  4  5 

D3  -2  4  6 

D5  -3  5  1 

D8-7  1 

D6-3  4 

D14-10  11  15 

D10-8  9 

D9-6  12  13 

D7-5  10  11 

DU -9  1 

D12-8  1 

D13-1  14 

Group  I 

Group  II 

Group  III 

Group  IV 

Average  Keys 

per  Cell  «=  6,25 

Figure  4-5 

Classification  Tree,  14  Docuaents,  N  ■  2,  E  •  0 
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Pass  1  processing,  the  relation 

(n^  ♦  a^)  <‘  (n j  a  ^ )  ♦  E 

can  hold  for  all  the  remaining  document's  even  though  one 
or  more  of  the  n^  (number  of  keywords  in  group  J)  are  zero. 
When  this  situation  arises,  the  empty  groups  are  ignored 
and  are  not  counted  In  the  total  number  of  cells  (even 
though  they  trivially  pass  the  cell  critierlon), 

A  more  serious  situation  occurs  when  this  afore¬ 
mentioned  condition  Is  carried  to  extremes.  That  is,  when 
all  but  one  group  Is  empty.  This  Ira  rare  situation  but 
Is  most  likely  to  occur  (and  has  in  both  large  files)  when 
the  documents  of  a  node  have  few  keywords  per  document 
and  there  is  a  high  degree  of  overlap  between  keywords. 

The  best  example  of  this  is  a  set  of  Identical  document 
descriptions  numbering  more  than  the  cell  criterion.  When 
this  situation  occurs,  the  classification  process  Is  endless, 
for  each  partitioning  will  result  in  one  group  of  all  the 
documents  and  N  -  1  empty  groups.  The  solution  arrived  at 
is  when  this  Is  recognizod  by  encountering  K-l  empty  des¬ 
criptor  groups  at  the  end  of  Pass  1,  to  arbitrarily 
partition  the  node  Into  N  equally  sized  (if  possible) 
groups  of  documents. 
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4.2.4  Discussion  of  Parameters 

4, 2.4.1  Stratification  Numbers 

The  stratification  (also  called  raalf lcatlon) 
number  of  a  node  Is  the  number  of  branches  (partitions) 
leading  out  of  that  node.  When  searching  for  an  optimum 
value  for  this  number,  a  number  of  factors  must  be  taken 
into  acoount.  Among  these  are: 

1)  What  is  the  best  number  In  relation  to  user 
efficiency  of  browsing  through  a  hierarchy? 

2)  What  is  the  best  number  in  relation  to  mini¬ 
mizing  the  number  of  keywords  per  cell  for 
any  given  number  of  cells? 

3)  How  does  the  size  and  soope  of  the  collection 
affect  the  optimum  stratification  number? 

Another  point  to  consider  Is  the  advisability  of  a 
fixed  stratification  number  versus  a  varying  one.  For 
simplicity.  In  these  experiments  the  stratification 
number  was  selected  at  the  start  of  each  classification 
and  was  not  changed.  Of  course,  the  lower  the  stratifica¬ 
tion  number,  the  deeper  (more  levels)  the  classification 
tree  will  be. 

With  reference  to  Item  1)  above,  Prywes,  et  al. 
[103,104]  have  found  that  based  on  minimizing  decision  time, 
the  node  stratification  number  has  a  broad  optimum  at  e 
(2.718...)  More  recently,  Thompson,  et  al.  [138*139] 
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have  taken  "window  shift  time"  as  well  as  "decision  time" 
Into  account.  Thompson  defines  these  terms  as: 

"Decision  time  r  , aired  for  visually  orienting 
to  an  alternative  branch,  focusing  on  the  word 
or  statement  describing  the  alternative,  and 
deciding  whether  or  not  the  alternative  is  in 
the  direction  in  which  to  continue  the  search." 

and  "Window  shift  time  required  to  shift  the  viewing 

window  to  the  next  level  of  the  tree  and  visually 
to  orient  oneself  to  the  new  display." 

In  an  on-line  interactive  system,  window  shift  time 
is  that  time  required  to  perform  an  indicating  function, 
such  as  touching  a  point  on  a  CRT  display  with  a  light 
pen,  plus  the  time  required  for  the  computer  to  retrieve 
and  display  a  new  tree  section.  It  was  found  that  the 
optimum  stratification  number  is  dependent  upon  the  ratio 
of  these  quantities,  but  independent  of  the  size  of  the 
data  base.  For  realistic  estimates  of  this  ratio,  tr.e 
optimum  node  stratification  number  was  found  to  always 
lie  in  the  range  3-5. 

Classl f  1  cat  1  on  experiments  were  performed  on  tr.e 
small  keyword  file  to  answer  item  2)  above.  These  experi¬ 
ments  varied  the  stratification  number,  N,  while  keeping 
the  sensitivity  factor  '£ ,  constant.  It  was  found  that  or. 
going  from  N  *  2  to  N  a  3,  tnere  was  a  reduction  of  about 
five  percent  in  the  number  of  keys  per  cell.  However, 
Increasing  X  beyond  3  did  not  significantly  affect  tne 
number  of  keys  per  ceil. 


Based  on  the  above  and  the  Intuitive  feeling  tr.at 
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the  answer  to  item  3)  is  that  the  stratification  number 
should  be  somewhat  larger  for  larger  collections,  the 
stratification  numbers  used  for  the  major  experiments 
of  this  research  were  chosen  to  be  N  *  3  for  the  small 
file  and  N  =  5  for  be  h  large  files.  One  possible  Jus¬ 
tification  for  Increasing  N  with  collection  size  is  that 
this  slows  the  Increase  in  classification  time.  In  fact, 
if  logjj  Njj  were  kept  a  constant,  the  classification  time 
would  t.  proportional  to  and  hence  the  time  (l.e.,  cost) 
per  document  would  remain  a  constant  for  any  collection 
size.  If  log,,  Nd  were  set  equal  to  seven  (log^  2254  *  7.03» 
log^  46900  s  6.64  for  the  collections  studied  here),  N 
would  not  reach  ten  until  10 '  documents  were  in  the  col¬ 
lection. 

4 . 2 . 4 . 2  Sensitivity  Factor 

The  sensitivity  factor,  E,  strongly  affects  the 
quality  of  the  final  classification.  E  is  used  during  Pass 
t  to  control  the  relative  sizes  of  the  groups  of  keywords 
as  they  grow.  A  small  E  tends  to  even  out  the  number  of 
keywords  per  group,  while  a  large  E  tends  to  emphasize  key¬ 
word  co-occurrence  among  the  document  descriptions.  For 
small  numbers  of  documents,  the  number  of  documents  per 
group  is  approximately  proportional  to  the  number  of  keys 
per  group.  Therefore,  since  it  is  desirable  to  form 
groups  of  about  the  same  size,  for  saall  collection® 


(or  the  lower  nodes  of  classifications  of  large  col¬ 
lections)  a  relatively  small  value  of  E  is  desirable. 

This  does,  not  necessarily  hold  for  large  collections. 

For  example-  In  Fig.  4-2,  group  3  has  IO85  keys  and  l'?23 
documents,  while  group  5  has  11 57  keys  and  only  393  docu¬ 
ments.  A  more  extreme  example  Is  a  case  (In  a  classifi¬ 
cation  of  96,821  documents)  where  two  groups  with  the  same 
rsunber  of  keywords  had  2738?  and  2367  documents  respectively. 

The  fact  that  larger  •'■aj.ues  of  E  emphasize  keyword 
co-occurrence ,  and  hence .  better  classifications  is 
Illustrated  .. n  Figure  4-6.  These  curves  (only  parts  of 
which  arc  shewn  in  the  diagram)  show  that,  holding  all 
other  parameters  constant,  fewer  keys  per  call  (l,e., 
better  classification)  results  from  Increasing  E.  However, 
the  improvement  gained  by  Increasing  E  decreases  as  E 
gets  larger.  This  can  be  seen  In  Fig.  4-6  by  observing 
that  the  decreases  In  keys  per  cell  are  about  equal  for 
E  going  from  0  to  10,  from  10  to  50  and  from  50  to  150. 

The  effect  from  Increasing  E  In  Pass  1  Is  felt  in  later 
passes  by  decreasing  the  number  of  redundant  descriptions 
processed  In  Pass  3  ( "PHASE  2  REDUNDANCY"  of  Fig,  4-2). 

For  the  examples  shown  in  Fig.  1-6,  the  sum  of  the  redun¬ 
dant  descriptions  processed  by  Pass  3  of  the  first  three 
levels  of  the  hierarchies  are  2029  for  E  *=  0  and  1267  for 
E  b  50. 

The  net  result  is  that  E  should  be  set  an  high  as 


Average  Keys  per  Cell 
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pos8lble  without  greatly  unbalancing  the  size  of  the 
groups  and  hence,  the  structure  of  the  hierarchy.  With 
this  In  mind  (for  the  experiments  reported  In  chapter  5)« 

E  was  set  at  50  for  the  small  file.  In  order  to  take 
advantage  of  all  the  above  aspects  of  E,  E  was  set  at 
75  -  150  for  the  top  of  the  classifications  of  the  large 
files  and  varied  down  to  25  -  40  for  the  lower  nodes  of 
the  classifications. 

4.2.5  Ordering  of  Input 

The  actual  categories  formed  by  CLASFY  and  therefore, 
the  quality  of  the  classification,  depend,  to  some  extent, 
on  the  order  In  which  the  documents  are  processed.  It  Is 
desirable  to  obtain  a  unique  ordering  of  documents  which 
optimizes  the  classification,  A  number  of  different 
orderings  were  tried,  some  unique  (Independent  of  the 
original  order)  and  some  not  (e.g.,  random  ordering).  One 
particular  ordering  was  found  to  outperform  all  of  the 
others  for  all  three  files. 

Because  the  orderings  used  are  similar  to  basic 
elements  of  the  other  classification  algorithms  described 
In  this  chapter,  they  will  not  be  discussed  here.  The 
reporting  of  the  results  of  the  different  orderings  will 
be  deferred  until  Chapter  5* 


The  classification  tree  of  Figure  4-5  doe3  not 


present  a  hierarchy  suitable  for  browsing.  For  browsing, 

It  is  desirable  for  the  more  general  terms  to  be  near  the 
top  of  the  hierarchy,  progressing  downwards  until  the 
most  specific  terms  are  near  the  bottom. 

The  hierarchy  of  keywords  is  formed  from  the 
bottom  to  the  top.  It  should  be  noted  that  this  keyword 
hierarchy  is  not  used  as  a  semantic  hierarchy  in  a  the¬ 
saurus  In  order  to  obtain  descriotora  for  documents, 
but  comes  about  a  posteriori.  Initially,  the  terminal 
nodes,  or  cells,  are  assigned  the  keywords  which  result 
from  the  union  of  the  keyword  surrogates  of  the  documents 
in  that  cell.  The  keywords  of  the  N  terminal  nodes  under 
a  parent  (next  level  up  the  hierarchy)  node  are  then  Inter¬ 
sected  and  those  resulting  keywords  are  assigned  to  the 
parent  node.  The  keyword  sets  of  the  original  N  nodes 
are  then  deleted  of  the  keywords  assigned  to  the  parent 
node.  This  process  is  continued  until  the  top  node  is 
reached.  A  hierarchy  was  generated  for  the  example  of 
Section  4.2.2  and  is  shown  in  Figure  4-7.  This  should  be 
compared  with  the  classification  tree  of  figure  4-5. 

Figure  4-7  also  indicates  the  canonical  node 
numbers  for  the  seven  nodes  in  the  hierarchy.  This  method 
of  node  numbering  allows  one  to  immediately  determine  the 
location  of  a  node  from  its  number.  For  e  ample,  coll  III 
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Node  1 


D1  -2  3 

D4-2 

7  6 

D2-2  4  5 

D3  -2  4  6 

D5  -3  5  1 
D10-8  9 

Dll -9  t 

D12-8  1 

D13-1  14 

D8-? 
D9  6 

1 

12  13 

06-3  4 

D7-5  10  11 

D14-10  11  15 

CELL  I 

CELL 

II 

CELL  III 

CELL  IV 

figure  4-7 

Keyword  Hierarchy  for  Example  of  Section  4.2.2 
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of  Fig.  4-7  Is  node  1.2.1.  To  find  this  node  one  starts 
at  the  top  (1),  takes  the  second  branch  from  the  left  (1.2) 
and  then  the  first  branch  from  the  left  (1.2.1).  The 
number  of  digits  in  a  node's  number  Indicates  the  level 
of  the  node  in  the  nlerarchy  (here,  cell  III  is  on  the 
third  level). 

It  might  seem  that  the  more  frequently  a  term  is 
used,  the  higher  It  should  be  in  the  hierarchy.  In  general, 
this  Is  true.  However,  equally  important  is  how  a  term 
is  used.  X  keyword  which  is  high  on  the  frequency  rank 
by  virtue  of  appearing  in  aimost  every  document  description 
of  a  few  specialties  would  not  rise  very  high  In  the 
hierarchy.  On  the  other  hand,  a  keyword  with  the  same 
frequency  of  occurrence  but  with  broader  appeal  might 
rise  close  to  the  top  oi  the  hierarchy.  Incidently,  there 
is  nothing  in  the  nlerarchy  generation  algorithm  to  prevent 
the  same  keyword  from  appearing  at  more  than  one  node. 

In  fact,  this  occurs  for  the  majority  of  trie  keywords. 

Two  properties  of  this  type  of  keyword  hierarchy 
are  wortn  noting  -h©  first  Is  that  the  keyword#, 

of  each  document  description  are  wholly  contained  in  the 
set  of  keywords  consisting  of  the  keywords  at  tne  nodes 
In  the  direct  path  from  the  top  of  the  hierarchy  to  tne 
terminal  node  which  contains  that  document.  In  other  words, 
referring  to  Fig.  4-7,  one  is  guaranteed  that  the  keywords 
of  document  D7  all  occur  in  the  union  of  the  keywords  of 
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nodes  1,  1.2,  and  1.2.1.  The  second  property  Is  that  each 
keyword  will  appear  at  most  once  in  any  given  path  from 
the  top  of  the  hierarchy  to  a  terminal  node.  This  means 
that  if  a  keyword  appears  at  node  1.2,  it  cannot  appear 
at  nodes  1,  1.2.1,  or  1.2.2. 

It  would  seem  that  if  a  reasonable  number  of  key¬ 
words  exist  above  the  lower  levels  of  a  hierarchy,  it 
means  that  the  classification  algorithm  did  not  do  a  very 
good  Job  of  grouping  like  documents.  To  some  expent  this 
is  true.  That  is,  the  better  the  classification,  the  fewer 
keywords  above  the  lowest  two  or  three  levels  of  the 
hierarchy.  For  an  ideal  collection  from  a  classification 
viewpoint  (l.e.,  the  cells  form  a  mutually  exclusive 
partition  of  the  entire  keyword  vocabulary),  there  would 
be  no  keywords  above  the  cell  level.  However,  in  any 
real  collection  there  are  a  sufficient  number  of  keywords 
generic  to  enough  segments  of  the  collection  to  form  a 
reasonable  hierarchy,  regardless  of  how  good  a  classifi¬ 
cation  one  achieves. 

Various  tables  are  required  for  using  a  hierarchy 
of  this  nature  in  an  IS&H  system. 

4.3.1  Node-to-Key  Table 

The  node-to-key  table  is  of  direct  use  in  brow¬ 
sing.  A  user  might  enter  a  hierarchy  at  node  1  and  suc¬ 
cessively  decide  to  proceed  to  node  1.2  and  then  1.2.1. 
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Phe  retrieval  system  must  be  able  to  quickly  retrieve  the 
keywords  appearing  at  any  given  node.  This  Is  done  by 
entering  the  node-to-key  table  with  a  node  number  and 
coming  out  with  the  corresponding  key  numbers.  Figure 
4-8  shows  the  node-to-key  table  for  the  example  of  this 
chapter. 

This  table  Is  actually  the  Internal  representation 
of  a  hierarchy  In  a  computer  memory. 

4.3.2  Key-to-Node  Table 

Tho  key-to-node  table  is  used  tor  retrievals  by 
conjunctions  and  disjunctions  of  keywords.  For  example, 
consider  the  key-to-node  table  of  the  current  example 
shown  In  Figure  4-9.  This  Is  actually  an  Inverted  file  on 
nodes  as  opposed  to  the  usual  Inverted  file  on  documents. 

Suppose  that  the  following  is  a  document  request 
(the  keywords  have  already  been  converted  to  integer  form) i 
4  &  (2  v  3  v  7). 

After  entry  Into  the  key-to-node  taDle  this  Is  converted 
to  s 

l . 2  &  (1  v  1.1.1  v  1.2.1  v  1.1.2) 

Because  each  description  mus*'  occur  In  the  path  between 
the  to p  node  and  a  document's  cell,  and  because  of  the 
nature  of  canonical  node  numbering,  only  those  conjuncts 
which  do  not  disagree  In  any  digits  (a  missing  digit  Is 
not  a  disagreement)  constitute  valid  search  paths.  In 
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1 

1.1 
1.1.1 
1.  c.2 
1  ,2 
..2.1 
1.2.2 


KEY 


3  •  b ,  9.  14 
c.  7.  12,  13 
4,  10,  *1 
3.  3 
6,  15 


Figure  4-8 

Node-to-Key  Table  for  Example 
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KEY 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 


NODE 

1.1 

1 

1.1.1.  1.2.1 

1.2 

1.1.1,  1.2.1 
1.1.2,  1.2.2 

1.1.2 

1.1.1 

1.1.1 

1.2 

1.2 

1.1.2 

1.1.2 

1.1.1 

1.2.2 


Figure  4-9 

Key-to-Node  Table  f or  Example 


! 

i 
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this  example  1.2  4  1.1,1  and  1.2  4  1.1.2  do  not  constitute 
valid  paths  as  they  differ  In  the  second  digit  (i.e.,  1.2 
4  1.1.2  imply  that  keywords  2  and  7  do  not  appear  In  any 
document  descriptions).  On  the  other  hand,  1.2  4  1  results 
in  node  1,2  and  1.2  4  1.2.1  results  in  node  1.2.1. 

The  original  request  could  have  been  to  display 
these  nodes  (via  the  node-to-key  table)  In  order  to  browse 
through  the  tree  without  having  to  start  at  the  top.  If 
th 1 s  were  the  case,  nodes  1.2  and  1.2.1  would  be  displayed. 
However,  if  the  request  was  for  the  documents  themselves 
(as  it  is),  a  third  table  must  be  consulted. 

4.3.3  Terminal  Node  Table 

The  terminal  node  table  Is  used  for  converting 
from  a  node  number  to  one  or  more  cell  addresses.  For 
this  example,  the  terminal  node  table  would  look  like» 


Terminal  Node 

.Cell 

1.1.1 

I 

1.1.2 

II 

1.2.1 

III 

1.2.2 

IV 

Upon  entering  this  table  with  a  node  number,  one  retrieves 
all  cell  locations  whose  terminal  node  numbers  match  the 
incoming  node  number  through  the  level  of  the  incoming 
node  number.  For  this  example,  the  node  numbers  under 
question  are  1.2  and  1.2.1  (see  previous  section).  For 
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1.2,  the  terminal  node  table  indicates  1.2.1  and  1.2.2 
as  terminal  nodes  and  therfore  cell  III  and  IV  are  Indicated. 
For  1.2.1,  cell  III  Is  Indicated  once  again.  Thus,  cell 
III  Is  to  be  searched  for  the  keyword  functions  (442)  and 
(4  4  3)  and  cell  IV  for  (4  4  2).  When  this  Is  done  (see 
Fig.  4-7),  documents  D2 ,  D3.  and  D6  are  retrieved. 

It  should  be  noted  that  In  thlc  system,  the  Indi¬ 
cation  to  search  a  cell  does  not  guarantee  that  any  docu¬ 
ment  descriptions  exist  In  that  cell  which  satisfy  the 
search  request.  One  purpose  of  classification  Is  to 
maximize  the  probability  that  if  a  cell  Is  searched,  tnere 
will  be  documents  there  which  satisfy  the  original  request. 

4.4  Forward  and  Reverse  Classifications 

When  working  on  any  relatively  complex  task,  one 
often  wondersi  "Isn’t  there  an  easier  way  of  doing  this?" 
With  that  thought  In  mind,  an  attempt  was  made  to  solve 
the  classification  proolem  oy  sorting  the  document  file 
and  then  partitioning  It  in  a  single  pass. 

The  results  of  this  endeavor  are  two  classification 
schemes,  herein  called  forward  and  reverse  classification, 
which  differ  only  on  the  sorting  order  and  not  on  tne 
partitioning  algorithm.  In  addition,  since  the  resulting 
orderings  are  unique  (lnlependent  of  the  original  ordering) 
they  were  also  used  as  input  to  CLASPY  (see  Section  4.2.5), 


The  rationale  behind  sorting  as  a  classification 
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algorithm  Is  that  In  a  file  sorted  on  keywords,  most  docu¬ 
ments  will  have  at  least  one,  and  many  times  more,  keyword 
In  common  with  Its  neighbors.  In  addition,  if  documents 
are  forced  together  by  virtue  of  having  a  few  particular 
keywords  the  same,  the  chances  are  good  that  other  keyword 
co-occurrences  will  exist. 

4,4,1  Forward  Ordering 

For  the  forward  ordering,  the  keywords  of  each 
document  are  first  sorted  (this  was  done  in  the  actual 
experiments  using  linear  selection  with  exchange  [  64  ]) 
in  ascending  order.  Figure  4-1 Oa  shows  this  for  the  14 
documents  of  Figure  4-3.  After  this  has  been  done  to  all 
the  document,  descriptions,  the  strings  of  keywords  are 
temporarily  considered  to  be  individual,  variable  length, 
strings  of  digits.  All  keywords  must  be  considered  to  be 
of  the  same  length  as  the  longest  keyword.  For  this  ex¬ 
ample,  the  string  of  D7  would  be  051011. 

The  documents  are  then  Borted  (in  the  actual  experi¬ 
ments,  IBM  system  sort  routines  [  65  ]  were  used)  in  ascen¬ 
ding  order  of  keyword  strings.  This  is  shown  in  .Figure 
4- iOb.  Thus, the  entire  file  ha?  been  ordered  by  frequency 
of  the  keywords  occuriing  in  the  document  descriptions. 

Figure  4-11  Bhows  the  first  45  documents  of  the 


title  word  file  in  forward  order.  The  headings  on  this 
figure  are:  NDOC  -  order  number  of  document,  ABNO  -  digit 
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"1"  followed  by  abstract  (document)  number  (00000  -  47055). 
SECT  -  a  priori  category  ("section”)  number  (not  used  here 
-  see  Section  4.6),  NSEL  -  number  of  keywords,  KEYS  - 
'  Integer  keywords. 

It  should  be  noted  that  this  ordering  does  not  al¬ 
ways  force  "like"  documents  to  be  near  each  other.  For 
example,  consider  two  documents  with  descriptions  1  5  6  7 
and  5^7  respectively.  They  would  not  be  placed  near 
each  other  because  In  the  forward  ordering,  56?  would  be 
placed  after  all  the  documents  with  keywords  1,2,  3,  and 
4  while  1567  would  be  placed  at -or  near  the  beginning 
of  the  ordering. 

4.4.2  Reverse  Ordering 

Basically,  the  reverse  ordering  Is  the  opposite 
of  the  forward  ordering.  However,  there  is  one  very  im¬ 
portant  difference.  Without  this  difference,  the  order  of 
the  keywords  In  each  document  description  would  be  switched. 
For  example,  10  11  15  would  become  15  11  10*  However,  15 
is  a  unique  (i.e.,  occurs  only  once  In  the  collection) 
keyword.  Therefore,  because  15  cannot  occur  in  any  other 
document,  It  doesn't  make  sense  to  use  15  as  the  highest 
order  keyword  for  sorting.  Therefore,  only  the  order  of 
the  non-unique  keywords  Is  reversed  In  going  from  the  for¬ 
ward  to  tht  reverse  ordering.  Now  10  11  15  becos  es  11  10 
15*  The  result  of  this  keyword  ordering  for  the  example  is 
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shown  in  Figure  4-12a  (here,  12  Is  the  lowest  unique  key¬ 
word  ) . 

The  sort  of  the  documents  on  the  keyword  strings 
proceeds  as'  in  the  forward  ordering  except  that  the  docu¬ 
ments  are  now  sorted' in  descending  order  of  keyword  strings. 
This  is  shown  in  Figure  4-l2b. 

4.4.3  Modified  Orderings 

These  oraerir.gs  can  be  Improved  to  some  extent. 
Consider  the  forward  ordering  of  Fig  4-1  Ob.  ..he  keywords 
of  document  D7  do  not  appear  elsewhere  In  the  vicinity  of 
that  document.  This  is  because  the  other  occurrences  of 
Keyword  3  occurred  In  descriptions  with  lower  keywords  (D2 
and  D5) .  and  theref ore  were  already  accounted  for.  The 
logical  location  for  D7  is  next  to  Dl4  since  both  documents 
have  keyword  10  (and,  incidentally,  keyword  11). 

This  situation  can  occur  whenever  all  but  one 
occurrence  of  a  keyword  appear  In  documents  with  lower 
(for  forward  ordering)  or  higher  (but  non-unlque,  for 
reverse  ordering)  keywords.  This  occurs  In  documents  D6, 

D7,  D9,  DIO,  and  Dl4  In  the  forward  ordering  (Fig.  4-1  Ob) 
and  documents  D12,  06,  D1 ,  and  D13  In  the  reverse  ordering 
(Fig.  4-12b).  While  keyword  co-occurrence  cannot  always 
be  Improved  by  modifying  the  ordering.  It  was  shown  by 
experimentation  that  It  can  in  enough  instances  to  make  it 
worthwhile. 


Figure  4-13  presents  a  flowchart  for  a  program 
which  modifies  the  ordering  of  the  forward  ordering.  For 
'the  reverse  ordering,  one  only  has  to  charge  the  state¬ 
ments  Involving  CURRENT  In  boxes  A,  B,  and  C  to:  CURRENT 
=  highest  keyword  (box  A),  CURRENT  =  CURRENT  -  1  (Box  B) , 
and  CURHENT  =  1?  (box  C). 

Figure  4-14  shows  the  resulting  forward  and  re¬ 
verse  ordering  for  the  file  of  the  example  after  processing 
by  the  modification  program.  For  the  forward  ordering, 
four  documents  (D6,  D7.  D9,  and  DIO)  were  removed  from 
the  Input  file  for  later  processing  (l.e.,  were  written 
onto  the  TERR  or  LAST  Mies)  and  three  oi  tnese  ultlmot-eiy 
wound  up  In  the  LAST  file  (Db,  V9 .  DIO)  before  being 
outputted.  Note  that  the  LAST  file  collects  all  the  docu¬ 
ments  whose  keywords  are  unlike  the  first  keyword  of  any 
other  document  In  the  final  ordering  (sort  of  a  garbage 
collector).  The  reverse  ordering  had  only  two  documents 
removed  from  the  Input  file  (D12  and  D6)  and  none  of  these 
wound  up  In  the  LAST  file.  It  should  be  noted  that  Is  Is 
possible  for  a  document  to  ve  rewritten  many  times  on  the 
TEMP  file  before  being  outputted  or  written  onto  the  LAST 
file  (and  later  outputted).  The  numbers  of  documents  re¬ 
moved  from  the  input  for  later  processing  and  those  written 
on  fche  LAST  file  for  the  two  orderings  of  each  of  the 
experimental  files  are  shown  In  Table  4-1.  Other  statis¬ 
tics  for  these  flies  can  be  found  In  Section  A-4. 
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FIGURE  4-1 J 


ORDER  HOD  IF  I  CAT  1011  PROGRAM  (FORWARD  ORDER) 


Pinal  (Modified)  Orderings 
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Hencefcrth,  reference  to  forward  and  reverse 
orderings  will  imply  those  orderings  after  modification. 

4,4,4  Classif i car  ion  Algorithm 

Probably  the  simplest  method  of  transforming  an 
ordering  into  a  classification  is.  for  documents  and 
ii_  cells,  to  declare  every  N^/Nc  documents  to  be  a  cell, 
however,  this  would  not  make  optimum  use  of  the  properties 
)f  an  ordering.  For  example,  consider  the  forward  ordering 
of  Fig.  4-1 4a.  For  four  cells,  Nd/Nc  =  3*5*  This  would 
■indicate  that  the  first  three  or  four  documents  should 
comprise  a  cell.  However,  it  is  obvious  that,  because 
each  contains  the  keyword  "1”,  the  first  five  documents 
should  be  lr.  the  first  cell. 

An  algorithm  to  allow  for  cases  such  as  this  has 
oeen  programmed  and  is  flowcharted  in  Figure  4-15,  Two 
e*  ameters  are  necessary:  fVR,  the  projected  average  call 
size  {  cj  and  MAX,  the  maximum  allowable  cell  size. 

I  he  actual  number  of  cells  resulting  from  this  classifica¬ 
tion  procedure  is  not.  known  in  advance  (the  same  Is  true 
for  CLASFY)  but  Is  a  function  of  the  values  chosen  for 
AVh,  MAX  and  the  contents  of  the  document  file  itself. 

This  algorithm  tries  to  divide  the  documents  into 
cellr  At  points  where  the  first  keyword  changes,  but  after 
> VH  cocuments  have  been  Included.  If  this  is  not  possible 
because  the  number  of  occurrences  of  a  keyword  in  the 
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figure  4-15 

ORDERED  FILE  CLASSIFICATION  ALGORITHM 
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first  position  in  documents  Is  larger  than  the  value  set 
for  MAX  (such  as  keyword  "1M  in  the  forward  ordering  of  a 
large  file;  e.g. ,  4294  times  for  the  large  keyword  file  - 
see  Fig.  A-2),  this  process  is  attempted  with  the  second 
keyword.  This  is  contined  until  a  keyword  position  is 
found  where  the  cell  division  can  take  place.  Since  this 
would  never  occur  if  there  are  over  MAX  documents  with 
identical  descriptions,  after  a  number  of  keyword  positions 
have  been  Investigated  (here,  arbitrarily  set  at  seven  - 
see  box  B  of  Fig.  4-15),  the  next  MAX  documents  are  con¬ 
sidered  to  be  a  cell. 

Even  though  an  occasional  cell  will  have  fewer 
than  AVH  documents  (i,e.,  if  there  are  fewer  than  AVR 
first  positional  appearances  of  a  keyword  but  the  Inclusion 
of  the  next  keyword  exceeds  MAX  documents  -  see  the  no 
branch  out  of  box  C  in  Fig.  4-15).  most  will  have  between 
AVR  and  MAX  documents.  In  addition,  glvjen  the  same  AVR 
and  MAX,  the  average* cell  in  a  forward  classification  will 
be  larger  (hence  fewer  cells)  than  the  average  cell  in  a 
reverse  classification  because  of  the  longer  "runs"  of 
first  position  keywords  in  the  forward  ordering. 

Thus,  AVR  and  MAX  must  be  carefully  chosen  to  ob¬ 
tain  the  number  of  cells  desired  (N„)  and,  at  the  same 
time,  optimize  the  classification.  The  more  cells  desired, 
the  higher  AVR  must  be  set  above  Nd/N0.  It  was  found  that 
for  the  experimental  files,  MAX  should  be  set  at  about 
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four  times  AVR  for  best  results  for  the  forward  classifi¬ 
cation.  The  same  factor  was  used  for  the  reverse  classifi¬ 
cation  but  the  largest  cells  formed  were  about  two  times 
AVR,  therefore  setting  MAX  above  this  value  had  no  effect. 

As  an  example,  the  following  are  statistics  on 
classifying  the  large  keyword  file  with  a  goal  of  about 
250  cells  (average  cell  size  =  1 6?  documents). 


Forward 

Reverse 

Number  of  documents,  Nd 

46821 

46821 

/  VR  set  at 

93 

179 

MAX  set  at 

372 

895 

Nd/AVR  e  projected 
number  of  ceils 

503 

262 

Actual  number  of  cells 

283 

247 

Largest  cell 

369 

314 

Average  documents 
per  cell 

165 

190 

i  Figure  4~i6  shows  the  file  of  the  example  used  in 

.  this  chapter  classified  into  four  cells  by  both  the  forward 

and  reverse  classifications. 

4.5  Random  "Classification^, 

In  order  to  see  how  each  of  the  classification 
algorithms  compared  to  doing  nothing  at  all,  a  random 

"classification'*  (it  is  admittedly  somewhat  of  a  contra- 
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diction  calling  a  random  process  a  classification}  was  set 

i 

f  up. 

I 

! 

i 


File 

Rnm-n  Keyword  Larp;e  Keyword  Title  Word 


u 


-p 

0 

B 

O 


O 

p  a 

r4 

-a 

o 

CM 

Ov 

0  H 

H 

Ov 

.a- 

O 

OS  P 

• 

« 

Ov 

1  -H 

C*- 

vO 

rH  P 

C*- 

.a- 

H 

a  ^ 

4)  c 
a  o 

CVI 

CM 

00 

Ov 

r\ 

vrv 

CV- 

• 

• 

Cv. 

Cv. 

o 

00 

00 

r( 

CM 

*T» 

rH 

H 

i 


A 

p 

rH 

n 

cH 

d- 

3c 

0 

CM 

-0- 

P 

00 

o 

rH 

0 

VO 

00 

CO 

0 

0 

n 

c 

4) 

0  a 
a  p 


00 

• 

vO 

vO 

-a- 


VO 

Ov 

• 

Ov 


rv 

o 

Ov 

vn 

rv 

»TV 

• 

a 

CO 

VO 

Cv. 

00 

H 

Cv. 

vO 

CM 

CM 

rH 

CVJ 

CM 


A 

P 

fi 

n 

Cv- 

S 

0 

CM 

VTV 

P 

vO 

vn 

r*C 

0 

CM 

cm 

0 

0 

• 

0 

to 

Ov 

c 

0 

rH 

0 

JC 

a 

p 

00 


rv 

v\ 

Cv- 


CVJ  CM 

Ov  vO 

Ov  CO 


CM 


I 

VT\ 

0 

a 


•0 

i 

a 

>» 

0 

a 

X 

CO 

p 

0 

c 

P 

0 

0 

P 

"0 

§ 

o 

o 

o 

CO 

A 

o 

«H 

P 

•0 

•0 

0 

a 

u* 

Cm  > 

o 

o  a 

g> 

p 

p  * 

rH 

0 

0  n 

H 

A 

A  *0 

0 

a 

a  p 

*2 

3 

3  o 

a 

a 

a  * 

M 

CO 

0 

o 

c 

« 

p 

p 

0 

o 

o 

o 

•o 

P 

o 

l 

V 

M 


0 

P 

O 

*h 


P 

0) 

Pi 

CO 

•a 

p 

o  *0 
5E  X 
>»25 
0) 

.*  * 
P 
4)  C 
bO  V 

21 
*>  o 
►  o 
<•0 


1  P 
>»  0 

0 

» 

•0 

0  .O^ 

•0 

0 

P  ""V 

> 

*  i  « 

O  0 

0 

P  c  0 

?  o 

rH 

0  o 

>»  c 

P 

P.  0  c 

0  o 

P 

to  0 

>» 

0 

0  0  P 

P 

p  p  p 

C  0  0 

0  H 

0  « 

0 

0  >  o 

o*  o 

P 

I 


o 

O 

o  • 
o 

<o  «  p 
•  o 

4)  rH  % 
tO—  ►» 

0  P 
►  O  V. 

<  *  o 


C  P 
0  0 
o 

Cm  O 
O  O 

P  * 
«>  . 

A  «> 
§* 


4)  CO 
P 


I 


CO 
O  4) 
O  P 

itt  o* 

4) 

r-t  P 
Si 

p  vr\ 
OVO 
tCfC 


p 

4) 

Pi 

n 

h 

0 

► 

V 

rH 

P 

P 

4) 

P 

V  P 

to  « 

<0  4) 
P  0 

4)  o' 

►  4) 

<<  P 


i 

i 

i 


File  Statistics 


-103- 


This  was  done  by  assigning  a  random  number  [  6l  ] 

to  each  doc.iment  and  then  sorting  the  file  on  this  rendom 

number.  The  result  of  this  is  a  randomly  ordered  file. 

One  might  expect  something  of  this  nature  in  a  file  ordered 

by  accession  number  only.  The  file  was  then  "classif led" 

by  considering  every  N^/Nc  documents  to  be  a  cell.  This 

results  in  N  cells  for  a  file  of  N.  documents, 
c  d 

Besides  using  the  random  classification  as  a  mea¬ 
sure  of  the  other  classification  systems,  the  random  order¬ 
ing  described  above  was  used  as  an  additional  file  ordering 
for  Input  to  CLASPY. 

4 . 6  Human  (A  Priori)  Classification 

The  last  of  the  five  (CLASFY,  forward,  reverse, 
random,  human)  classification  systems  under  study  here  is 
a  manually-generated,  a  priori  classification  system.  Each 
document  In  the  data  files  used  for  these  experiments  (see 
Appendix  A)  has  been  manually  assigned  one  of  almost  300 
categories.  Examples  of  category  numbers  assigned  to 
documents  can  be  found  as  the  four  digit  numbers  under  the 
"SECT1’  heading  in  Fig.  4-11. 

These  categories  form  a  hierarchy  of  up  to  five 
levels  (Including  the  complete  file  as  a  level)  with  the 
following  stratification  numbers* 


Node  Ljvel 


Node  Stratlfl cat ion 
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1  11 

2  3-11 

3  2-9 

4  6-8 

Because  of  the  hierarchical  nature  of  this  system,  docu¬ 
ments  In  categories  whose  numbers  differ  slightly  are 
supposed  to  be  fairly  close  in  content.  This  enables  one 
to  transform  this  classification  into  one  of  any  number  of 
categories. 

Once  again  the  files  were  sorted,  but  this  time 
on  category  number.  The  files  were  then  classified  into 
a,  Nc  cells  by  dividing  the  files  after  every  »  Nd/Nc 
documents,  ensuring  that  the  division  (if  possible)  occurs 
at  the  end  of  a  category.  At  should  be  noted  that  this  is 
the  same  classification  algorithm  as  that  used  for  the  for¬ 
ward  and  reverse  classifications  (see  Section  4.^,4)  with 
the  category  numbers  considered  as  s  ogle  keywords  (l.e., 
one  per  document). 

This  classification  was  used  to  compare  the  quality 
of  a  posteriori  automatic  classification  systems  to  that 
of  a  manually-generated  a  priori  system. 


CHAPTER  5 

EXPERIMENTS  AND  RESULTS 


5.1  Data  Files 

Appendix  A  contains  complete  descriptions  of  the 
data  files  used  for  these  experiments.  The  three  files 
under  study  are  the  small  keyword  file,  the  large  keyword 
file,  and  the  title  word  file.  The  file  statistics  dea- 
cussed  in  Section  A. 4  are  repeated  for  convenience  in  Table 
5-1  along  with  the  retrieval  statistics  reported  in  Section 
B.2. 

The  small  keyword  file  was  used  to  set  the  para¬ 
meters  for  the  experiments  on  the  large  files.  In  addition, 
it  was  used  in  conjunction  with  the  large  keyword  file  in 
order  to  relate  the  classification  results  to  file  size 
(the  large  file  has  twenty  times  more  documents  than  the 
small  file).  The  most  significant  experimental  results 
obtained  here  Involve  the  large  keyword  file  and  the  (large) 
title  word  file.  These  files  contain  essentially  the  same 
documents  («  46900  each  out  of  the  same  47055  documents) 
but  are  Indexed  by  Independent  methods,  one  manual  and  ons 
automatic.  The  indexing  of  the  same  document  collection 
by  two  different  methods  was  done  in  order  to  determine  if 
the  quality  of  the  various  classification  schemes  is  a 
function  of  the  type  of  indexing  used.  Similar  results 
from  the  two  large  files  would  therefore  indicate  that  the 
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classlf lcatlon  techniques  used  are  relatively  Independent 
of  Indexing  method.  Later  In  this  chapter  It  will  b>. 
shown  that  this  Is  Indeed  the  case. 

5.2  Experimental  Measures  of  Quality  of.  Classification 
Various  experiments  were  performed  on  the  data 
flies  by  themselvt.-  the  flies  in  conjunction  with 

retrieval  requests  (see  Appendix  B  and  Table  5-1 )  lh  order 
to  measure  the  relative  quality  of  the  classification 
schemes  described  In  the  previous  chapter.  The  measures 
used  are  discussed  below  In  this  section,  while  the  actual 
experiments  and  results  are  described  In  the  following 
sections. 

One  objective  measure  of  the  quality  of  classifi¬ 
cation  systems  was  discussed  in  Section  2.1.  Documents  In 
a  cell  are  probably  close  In  subject  If  there  are  a  large 
number  of  ke”word  co-occurrences  In  that  cell.  Therefore, 
the  average  number  of  discrete  keywords  per  cell  (N^-), 
with  an  aim  towards  minimization  of  this  quantity,  is  one 
measure  of  the  relative  quality  of  classification  systems. 

There  Is  a  possibility,  however  remote,  tnat  a 
clasaif lcatlon  system  can  do  fairly  well  on  the  above  teat 
and  yet  produce  a  poor  set  of  categories  from  a  retrieval 
efficiency  point  of  view  (l.e. .  does  not  sufficiently  re¬ 
duce  the  number  of  memory  accesses  per  search  request  - 
see  Seotlon  2.2.5).  This  sight  come  about  if  a  classifi- 
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cation  system  happened  to  do  a  good  Job  of  bringing  to¬ 
gether  documents  with  keywords  which  are  generally  not  used 
in  search  requests,  but  a  poor  Job  of  doing  so  with  fre¬ 
quently  (from  a  search  request  viewpoint)  used  keywords. 
Admittedly,  it  does  net  seem  likely  that  a  system  with  a 
very  low  average  key  per  cell  count  could  be  very  Inef¬ 
ficient  in  actual  memory  accesses,  but  two  systems  of 
equal  might  differ  in  retrieval  efficiency.  In  addition, 
the  number  of  memory  accesses  required  for  a  classified 
file  can  be  compared  to  those  required  for  an  Inverted 
file  system.  In  this  way,  one  can  obtain  a  measure  of  the 
•retrieval  time  savings  offerred  by  automatic  classification. 

In  order  to  test  this,  actual  search  requests  can 
be  applied  to  all  classifications  in  question.  The  number 
of  cells  which  must  be  searched  per  question  is  a  measure 
of  the  number  of  memory  accesses  required  per  question. 
Naturally,  the  goal  is  to  minimize  t  ils  quantity.  The  num¬ 
ber  of  retrieval  requests  used  must  be  large  enough  for  thw 
results  of  this  test  to  be  significant. 

Another  measure  Is  the  number  of  documents  searchsd 
per  request.  While  it  is  true  that  this  quantity  should  be 
dependent  upon  the  number  of  cells  searched,  because  cf  tha 
variability  In  the  size  of  cells  additional  insight  can  bs 
obtained  by  aeasuring  this>  quantity.  If  the  number  of  cells 
searched  per  request  is  identical  for  two  classification 
systems,  then  the  better  system  is  probably  the  one  which 


calls  for  searching  fewer  documents  per  request. 

It  Is  more  difficult  to  quantitatively  meesure  the 
quality  of  a  hierarchy  than  It  Is  to  measure  the  quality  of 
the  classification  Itself.  One  measure  of  limited  value 
is  the  number  of  entries  in  the  key-to-node  table  (which 
is  the  same  as  the  number  of  entries  in  the  node-to-key 
table).  A  smaller  key-to-node  table  indicates  that  more 
keywords  have  migrated  upwards  from  the  cells  towards  the 
apex  of  the  tree,  -hereby  producing  a  hlerarcny  richer  In 
keywords,  and  requiring  less  storage  space.  Therefore, 
given  two  hierarchical  systems  with  all  else  being  equal, 
the  one  with  the  smaller  key-to-node  table  Is  probably 
better. 

The  best  means  of  measuring  the  quality  of  a  hier¬ 
archy  Is  probably  large-scale  testing  of  its  usefulness 
for  browsing  (not  done  here).  Short  of  this  however,  the 
alternative  unfortunately  is  a  subjective  measure.  This 
is  to  look  at  the  keywords  at  the  various  nodes  and  decide 
whetner  they  represent  a  distinct  part  o:  the  collection 
or  are  merely  a  Jumble  of  unrelated  keywords.  In  a  good 
hierarchy,  one  should  be  able  to  consider  the  keywords  at 
a  node  as  an  abstract  of  the  knowledge  contained  beneath 
that  node  in  the  tree. 

As  stated  In  the  beginning  of  this  section,  all 
of  these  aeasures  were  used  In  the  following  experiments. 


Each  file  was  classified  numerous  times  by  the 


various  classification  algorithms  desciifce-d  in  the  previous 
chapter.  three  files  were  classified  using  the  fol¬ 

lowing  classification  schemes:  human,  forward,  reverse, 
random,  and  CLASFY  wj.uh  the  file  in  thu  reverse  order,  I** 
addition,  both  keyword  files  were  classified  by  CLASFI 
with  the  files  in  forward  and  random  orders. 

The  results  of  t^ese  classlf icaticns  are  shown  in 
Figures  5-1,  5-2,  and  5-3  for  the  small  keyword,  large 
keyword,  and  title  word  files,  respectively.  The  results 
ore  presented  in  the  form  described  in  .’eotion  2.1.  That 
section  can  oe  referred  to  for  the  significance  of  the 
parallelogram  envelopes  and  the  various  numbers  on  the 
axe  . 

Each  classification  of  each  file  is  plotted  from 

one  cell  to  N.  cells  (l.e.,  one  document  per  cell).  Th* 
d 

probable  ranges  of  interest  for  the  small  and  large  files 
are  as  follows: 

File  Size  Documents  Cell  Range  Documents  per  Cell 
**  ”  of  Interest 

Saall  2254  30  -  125  75  -18 

Large  «46900  200  -  1500  235  -  31 

Because  of  the  difficulty  of  displaying  co  many 
curves  on  single  sheets  of  paper,  the  actual  classification 
points  ar»  not  shown.  However,  enough  classifications  were 


2557 


0000 1 


Nuaber  of  Celia 
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made  to  ensure  smooth  curves.  The  greatest  n'unber  of 
coils  produced  for  the  large  files  was  2500  for  all 
classifications  but  CLASFY.  The  most  colls  produced  by 
CLASFY  (large  keyword  file,  reverse  ordering)  was  1284, 
requiring  5  hours  and  20  minutes  on  an  IBM  ?040  computer. 

This  time  could  be  greatly  reduced  by  the  execution  of 
CLASFY  on  a  more  modern  computer  (the  other  classifications 
were  performed  on  an  IBM  360/65)  with  faster  cycle  time  and, 
of  much  greater  importance,  faster  and  larger  capacity 
secondary  storage  facilities.  However,  since  a  file  is 
not  classified  very  often,  as  long  as  the  classification 
time  stays  within  reason  (see  Section  3*2)  the  .ctual  pro¬ 
cessing  time  is  not  very  significant. 

The  curves  for  CLA3FY  with  the  files  in  random 
order  (keyword  files  only)  were  not  drawn  on  Figures  5-1 
and  5-2  because  they  lie  completely  between  the  curves  of 
CLASFY  with  tl.a  f  les  in  forward  and  reverse  orders 

The  following  are  some  observations  and  conclusions 
based  upon  Figures  5-1,  2,  and  3* 

1)  The  results  for  the  three  files  are  very  o^mllar. 
This  implies  that,  based  on  the  keyword  files, 
the  relative  quality  of  the  classification 
systems  studied  is  Independent  of  the  size  of 
tht  collection.  In  addition,  based  on  the  two 
large  files,  the  implication  is  that  the  rela- 
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tlve  quality  of  the  classification  systems 
is  also  independent  of  how  a  collection  Is 
Indexed.  This  leads  one  to  the  conclusion 
that,  assuming  the  collection  used  here  is 
representative  of  other  technical  collections, 
the  order  of  quality  of  these  systems  (with 
some  minor  exceptions)  is  absolute  and  hence 
independent  of  the  collection  to  be  classified. 

2)  As  expected,  all  the  "legitimate"  classification 
systems  outperform  the  random  classification 

by  a  considerable  margin. 

3)  CLASFY  is  the  best  (based  on  the  number  of  keys 
per  cell)  classification  system  studied.  In 
particular,  for  all  tnree  files  and  for  any 
number  of  cells,  CLASFY  outperforms  the  human, 
a  priori  system. 

4)  CLASFY  outperforms  the  other  rystems  regardless 
of  the  order  of  the  file  presented  to  the  CLASFY 
algorithm.  However,  it  performs  best  (by  a 
small  margin  which  decreases  with  increasing 
file  size)  when  the  input  file  is  in  reverse 
order. 

5)  The  forward  classification  does  poorly  on  few 
cells  but  improver  and  overtakes  the  human 
classification  as  the  number  of  cilia  Increases. 
This  crossover  in  quality  takes  place  in  or  Just 
above  (i.e.,  more  cells  than)  the  region  of 
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Interest. 

6)  The  reverse  classification  starts  off  very 
well  (few  cells)  but  is  not  as  good  as  the 
forward  classification  during  most  of  and  be¬ 
yond  the  region  of  interest. 

7)  Even  though  CLASFY  represents  the  best  system 
studied  here,  Its  curves  of  N^c  vs.  Nc  are  above 
the  diagonals  of  the  parellelogram ,  In  Sec¬ 
tion  2.1  It  was  stated  that  these  lines  repre¬ 
sent  the  approximate  regions  of  the  expected 
plots  of  Nkc  vs.  Nc.  That  statement  was  made 
based  on  these  results  and  the  realization  that 
while  CLAbFY  might  be  a  good  classification 
system,  future  study  and  experimentation  will 
probably  turn  up  better  classification  algo¬ 
rithms.  In  addition,  Just  how  close  this  plot 
is  to  the  diagonal  line  Is  somewhat  dependent 
upon  the  document  collection  as  well  as  the 
classification  system. 

5 . 4  Results  of  Retrieval  Requests 

One  hundred  sixty-five  actual  retrieval  requests 
were  used  to  interrogate  the  flies.  See  Appendix  B  for 
details  on  these  requests.  Table  5*1  shows  the  number 
of  documents  retrieved  In  each  file  as  a  result  of  these 


requests. 
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The  requests  were  submitted  to  each  classified  file 
Individually  as  if  they  were  part  of  an  on-line  system. 

That  la,  batching  of  requests  was  not  allowed.  The  number 
of  cells  searched  and  number  of  documents  searched  were 
recorded  for  each  request  and  then  totaled  for  the  1 65 
requests.  It  should  be  noted  that  browsing  was  not  used 
on  the  files  classified  by  CLASFY  or  humans  (the  only 
hierarchic  systems,  and  therefore  the  only  ones  which  allow 
for  browsing)  to  reduce  the  number  of  cells  searched. 
However,  browsing  would  probably  be  an  Integral  part  of 
any  actual  on-line  system  using  CLASFY. 

5.4.1  Theoretical  Results  with  Inverted  File 

No  experiments  were  performed  with  any  of  the  files 
organized  as  an  Inverted  file.  However,  because  the  docu¬ 
ments  of  an  Inverted  file  are  essentially  in  random  order. 

It  Is  possible  to  obtain  some  theoretical  results. 

The  number  of  cells  searched  la  a  measure  of  the 
retrieval  efficiency  If  a  cell  Is  equivalent  to  an  appro¬ 
priate  unit  of  memory  (see  Section  2.2.5).  Considering  this 
to  be  the  case,  one  cm  consider  en  Inverted  file  as  baing 
divided  linearly  Into  physical  cells.  One  can  now  calcu¬ 
late  the  average  number  of  cells  which  must  be  entered  par 
request  for  an  Inverted  file. 

The  number  and  size  of  the  cells  will  be  considered 
to  be  the  same  as  tnose  of  the  classified  flies  to  that 
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the  results  will  be  directly  comparable.  However,  there  are 
two  somewhat  offsetting  Items  which  should  be  kept  In 
mind.  An  Inverted  file  Is  usually  set  up  such  that  either 
of  the  following  Is  true. 

1)  The  records  of  the  file  are  not  blocked.  This 
means  that  In  response  to  a  request,  only  the 
desired  document  need  be  transmitted  Into 
high-speed  storage.  Therefore,  the  Input/out¬ 
put  time  required  for  this  method,  per  memory 
access .  Is  lower  than  that  of  a  classified 
file.  However,  this  organization  requires 
more  storage  space  than  (2)  below  and  hence 
more  physical  cells  are  required  to  contain 
the  file.  This  results  In  more  memory  accesses 
per  request. 

2)  The  records  In  the  file  are  blocked.  This 
requires  fewer  cells  than  the  above  and  hence 
fewer  memory  accesses;  however,  because  a 
significant  portion  of  a  cell  irust  be  read 
Into  high-speed  storage,  the  input/output 
time  per  memory  access  Is  not  much  less  than 
that  for  a  classified  file. 

In  either  of  the  above  cases,  the  number  of  documents 
searched  for  an  lnv.erted  file  Is  equal  to  the  number  of 
documents  retrieved. 

The  problem,  therefore.  Is  given  X  documents 


-119 


(i.e.,  X  documents  will  be  retrieved)  randomly  distributed 
over  Nc  cells,  what  is  the  expected  value  for  the  number 
of  non-empty  cells.  A.  constraint  on  tha  problem  is  that 
there  is  a  maximum  number  of  documents,  C,  which  can  be 
placed  into  each  cell. 

Consider  a  single  cell.  The  probability  of  a 
particular  document  being  placed  in  that  cell,  assuming 
completely  random  placement  of  documents,  is  1/N0.  The 
probability  that  1  documents  out  of  X  are  placed  in  that 
cell  is  therefore 

({m/Nc)l<i  -  1/NC-)X'1. 

Since  there  are  Nc  Cells,  the  expected  (i.e.,  average) 
number  of  cells  with  1  documents  is 

(Nc)(f)(l/Nc)l(l  -  i /I^)*'1. 

The  number  of  cells  with  at  least  one  but  not  more  than 
C  documents  is  what  la  desired.  Hence,  summing  from  1  * 

1  to  C  results  in  (up  to  now  c  has  been  Ignored) i 

Nc  L  (J)(l/»c)l(l  -  1/NC)X-1 
cells,  where  a  *  min  (C.X). 

It  is  noted  that  the  above  summation  represents 
part  of  the  cumulative  binomial  probability  function.  For 
cases  when  C  <  X  (i.e..  c  *  C),  N0  is  large  and  hence  i/Nc 
is  small.  Looking  up  values  for 

l  <iVu  -  p)"*1 

i  «o 

under  the  condition  of  small  values  for  p  [i88j,  one  finds 
that  this  quantity  is  insignificant.  Therefore,  the 
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ex  peo  tod  number  of  non-empty  cells  Is  approximately 

Nc  £  (*)(l/Nc)'Sl  -  1/NC)X'1 
1=1 

cello.  However,  since  by  the  binomial  formula,  for  p  <  1 

l  <iVu  -  p)*-1  .  i. 

1=0  1 

the  number  of  non-empty  ce^ls  Is  equal  to 


N0[l  -  (J)(l/Nc)u(l  -  1/NC)X*°]  -  Nc[l  -  (1  -  1/N0)*] 


This  expression  represents  the  number  of  colls 
which  must  be  accessed  for  a  request  of  X  documents  on  a 
file  divided  Into  Nfi  cells.  The  above  expression  was 
evaluated  for  the  parameters  of  the  files  and  retrieval 
requests  under  study  and  Is  presented  In  graphical  form, 
along  with  the  classification  results,  In  the  next  section. 

5.^.2 

The  number  of  cells  searched  when  the  1 6 5  requests 
were  applied  to  the  various  classifications  of  the  small 
keyword  file  Is  shown  in  Figure  5-u.  The  results  .shown  in 
that  figure  are  not  ver.  encouraging  because  not  only  does 
the  Inverted  file  system  cause  fewer  cells  to  be  searched 
(accessed)  than  any  of  tne  other  systems,  but  all  the 
classification  systems  require  more  cells  to  be  searched  In 
the  range  of  Interest  (l.e.,  30  -  125  cells)  than  the  number 


of  documents  retrieved!  Tnls  Is  due  to  cell?  containing 


Plgure  5-4 
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keywords  which  match  a  request,  bnt  not  having  any  docu¬ 
ments  In  them  which  have  the  proper  keywords  to  satisfy 
the  request. 

this  is  the  problem  with  insuf? Icient  numbers  of 
documents  referred  to  in  Section  3.2.  By  the  time  the 
number  of  documents  have  risen  to  almost  50000  {large 
keyword  file),  the  situation  has  reversed  itself.  Figure 
5-5  shows  the  cells  searched  vs.  number  of  cells  plots  for 
the  large  keyword  file.  Here,  the  CLASFY  (any  ordering), 
human,  and  forward  classifications  surpass  the  Inverted 
file.  The  same  holds  crue  for  the'  title  word  file  (Figure 
5-6). 

For  the  cells  searched  on  the  large  keyword  file, 
CLASFY  with  the  input  file  in  reverse  order  is  be3t, 
barely  edging  out  other  input  orderings  to  CLASFY  and  the 
human  classification,  with  no  others  being  close  In  the 
range  of  Interest  (200  -  1500  cells).  In  the  case  of  the 
title  word  file,  the  CLASFY,  human,  and  forward  classifi¬ 
cation  systems  are  all  fairly  close  In  the  range  of  interest, 
In  fact  their  plots  cross  over  between  450  and  600  cells. 

It  should  be  noted  that  for  both  large  files,  the  number 
of  cells  searched  for  the  reverse  and  random  olasslf ication 
still  exceed  those  searched  for  the  inverted  file  and  also 
exceed  the  number  of  documents  retrieved. 


A  point  made  in  Section  5*2  is  illustrated  in  the 


100000 


Figure  5-5 

Cells  Searched,  Large  Keyword  File 


Figure  5-6 

Cells  Searched,  Title  Word  File 
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graphs  of  the  two  measures  discussed  thus  far  of  the  title 
word  file  (Figures  5-3  and  5-6).  On  the  basis  of  keys  per 
cell,  CLASFY  Is  better  than  the  forward  classification  for 
any  number  of  cells.  However,  on  the  basis  cf  cells 
searched,  the  reverse  holds  for  more  than  450  ceils.  This, 
and  other  examples  which  can  be  found  in  these  two  sets  of 
graphs,  shows  that  the  different  classification  systems  do 
indeed  emphasize  the  co-occurrence  of  different  keywords. 

In  the  range  of  Interest  (200  -  1500  cells),  the 
inverted  file  requires  19  to  ?8  percent  more  cells  searches, 
and  hence  memory  accesses,  than  does  CLASFY  for  the  large 
keyword  file  and  18  to  44  percent  more  than  CLASFY  for  the 
title  word  file.  In  an  on-line,  large  scale  system  the 
advantages  of  CLASFY  should  increase,  perhaps  drastically, 
for  two  reasons: 

1)  In  going  from  the  small  to  large  keyword  files 
the  number  of  cells  searched  using  CLASFY  de¬ 
creased  tremendously  relative  to  those  searohed 
for  the  inverted  file.  This  trend  can  be  ex¬ 
pected  to  continue  for  la^er  files. 

2)  The  number  of  cells  searched  shown  in  these 
figures  does  not  take  into  account  the  browsing 
capabilities  of  CLASFY.  The  procedures  fol¬ 
lowed  for  browsing  in  a  hierarchy  are  described 
in  Section  2.2,3.  During  the  course  of  request 
refinement  allowed  by  browsing,  certain  sections 
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of  the  tree  would  probably  be  eliminated  be¬ 
cause  of  lack  of  relevance  to  the  retrieval 
request.  This  would  reduce  the  number  of 
cells  (and  documents)  searched  without  appre¬ 
ciably  decreasing  the  number  of  pertinent  docu¬ 
ments  retrieved.  This  cannot  be  done  In  a 
non-hlerarchic  system,  such  as  an  inverted 
file.  The  magnitude  of  the  advantage  des¬ 
cribed  above  Is  potentially  quite  large; 
however,  quantitative  results  are  not  available 
because  appropriate  experiments  have  not  as  yet 
been  performed. 

5.4.3  Documents  Searched 

The  number  of  documents  searched  for  a  request  Is 
not  completely  dependent  upon  the  number  of  cells  searched. 
Due  to  the  variations  In  the  oell  sizes,  a  different  number 
of  documents  may  be  searched  for  the  same  number  of  cells 
searched. 

This  effect  can  be  3een  in  the  plots  of  the  numbers 
of  documents  searched  vs,  number  of  cells  shown  in  Figures 
5-7,  5-8*  and  5“9*  It  Is  noted  that  fewer  documents  are 
searched  when  the  file  Is  organized  by  the  human  classifi¬ 
cation  than  by  any  other  system  even  though  CLASFY  out¬ 
performs  the  human  classification  (in  most  cases)  in  number 
of  cells  searched  (Figures  5-4,  5-5.  and  5-6)'  Part  of 


Number  of  Cells,  N„  100  000 


Documents  Searched,  Large  Keyword  File 


K^Bsnbsy  *  pfci  s^uocnooa 


Searched,  Title  Word  File 
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the  explanation  of  this  phenomenon  lies  with  the  fact 
that  the  cells  of  the  human  classification  are  more  even 
with  respect  to  numbers  of  documents  than  those  produced 
by  CLASFY.  Because  a  cell  with  more  documents  is  more 
likely  to  be  accessed  than  one  with  fewer  documents,  the 
size  of  the  average  cell  searched  in  a  system  using  CLASFY 
would  be  larger  than  that  in  a  system  using  the  human 
classification.  Therefore,  for  the  same  number  of  cells 
searched,  one  would  expect  somewhat  more  documents  searched 
with  CLASFY  than  with  the  human  system.  The  remainder  of 
this  effect  (the  above  does  not  account  ‘for  all  of  it)  is 
attributed  to  the  different  characteristic  of  the  classi¬ 
fication  systems. 

The  above  effect  can  be  seen  in  columns  (a)  and  (b) 
of  Table  5-2.  It  is  noted  that  the  percent  documents 
searched  is  always  gi eater  than  the  percent  cells  searched. 
The  number  of  cells  shown  represent  the  low  end,  logarith¬ 
mic  center,  and  high  end  of  the  range  of  Interest, 

Col’usns  (c)  and  (d)  of  Table  5-2  show  the  percent 
and  number  of  documents  searched  per  document  retrieved. 
Fortunately,  the  percent  of  documents  searched  per  retrieval 
decreases  (in  the  corresponding  cell  ranges  of  interest) 
with  Increasing  file  size.  This  results  in  the  number  of 
aocuments  searched  per  document  retrieved  remaining  essen¬ 
tially  constant  (see  column  (d).  Table  5-2)  with  respect  to 
file  size.  This  is  very  significant,  for  if  it  holds  for 
arger  collections,  It  means  that  regardless  of  collection 
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size,  the  same  number  of  documents  must  be  transmitted  Into 
main  storage  per  document  retrieved.  Because  the  number 
of  documents  per  cell  would  tend  to  be  the  same  or  greater 
for  larger  collections,  this  Implies  that  the  number  of 
cells  searched  (and  hence,  number  of  memory  accesses)  per 
document  retrieved  would  remain  the  same  or  be  fewer  for 
larger  collections. 


5 i 5 • 1  Size  of  the  Koy-to-Node  Table 

The  size  of  the  key-to-node  table  (or  equivalently, 
the  node-to-key  table)  reflects,  to  some  extent,  the  quality 
of  a  hierarchical  classification  system.  For  the  same 
number  of  keys  per  coll,  a  smaller  key-to-node  table  means 
that  more  keywords  have  migrated  upwards  in  the  tree, 
thereby  producing  o  fuller  hierarchy.  In  addition,  a 
smaller  key-to-node  table  moans  less  storage  space  is 
required  to  store  the  „:\ble. 

Because  CLAS.'Y  is  the  only  automatically  generated 
hierarchical  system  being  studied  here,  tne  hierarchies 
produced  by  CLASFY  cannot  be  compared  wl^h  those  produced 
by  other  systems. 
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Figures  5-10.  5-H.  and  5-12  show  the;  size  of  t no 
key-to-node  tables  for  CLA3FY  with  and  without  hierarchy 
generation.  Of  course,  the  size  of  a  key-to-node  table 
without  a  hierarchy  (l.e.,  the  only  nodes  are  cells)  is 
equal  to  the  total  number  of  keywords  in  the  cellu,  or 
Nkc  x  Nc*  The  curves  for  random  and  human  classifications 
(no  hierarchies)  are  s.nown  for  comparison. 

From  these  figures,  it  can  be  seen  that  forming 
a  hierarchy  for  a  collection  classified  by  CLASFY  yields 
about  a  ten  percent  reduction  in  the  size  of  the  key-to- 
node  table. 

5-5-2  Subject! ve  Evaluation  and  Httample 

The  evaluation  of  the  quality  with  respect-  to 
browsing  of  a  hierarchy  is  necessarily  subjective.  The 
best  way  of  doing  this  Is  to  allow  a  number  of  users  in 
the  field  of  the  collection  to  utilize  tne  system  (on-line) 
for  a  reasonable  period  of  time  and  then  present  their 
opinions  on  the  utility  of  tne  Hierarchy  for  browsing 
purposes.  However,  since  tne  retrieval  system  has  rot 
yet  been  implemented  for  on-line  Drowsing,  experiments  of 
tnls  type  have  not  been  performed. 

Instead,  subjective  evaluations  of  the  hierarcnlee 
obtained  were  made  by  :.hc  author  and  some  associates. 
Attempts  were  made  to  extract  n  unifying  title  out  of  tne 
geywerds  which  appear  at  eaoh  node.  After  tnat  was  done. 


-Node  Table,  Small  Keyword 


Size  of  Key-to-Node  Table,  Large  Keyword  Pile 


Sire  of  Key-to-Node  Table,  Title  Word  File 


each  hierarchy  was  examined  fcr  Its  ability  to  separate  suo- 
Ject  areas  and  to  see  If  the  node  titles  Indicate  Increasing 
specialization  os  one  proceeds  down  a  path  In  the  tree. 

This  was  not  very  successful  for  the  samil  keyword 
file.  It  was  found  that  there  were  too  few  (225*0  documents 
in  the  collection  to  provide  reasonable  subject  separation 
at  the  various  nodes  In  the  hierarchy.  On  the  other  hand, 
the  hierarchies  generated  from  the  output  of  CLASFf  on  the 
large  files  seem  quite  acceptable  for  Nc  >  100. 

The  following  example  of  portions  of  a  hierarchy 
is  taken  from  a  classification  of  the  large  keyword  file 
in  reverse  order  by  CLA3FY.  Appendix  „  presents  the  com¬ 
plete  hierarchy  for  a  similar  classification  performed 
by  CLASFY.  For  this  classification,  the  node  stratifi¬ 
cation  number,  N,  was  sec  at  5.  the  sensitivity  factor, 

3,  was  varied  from  150  at  the  top  of  the  hierarchy  to  25 
at  the  bottom,  ana  the  cell  criterion,  C,  was  set  equal 
to  a  maximum  of  460  documents  per  cell.  As  a  result  of 
the  classification,  249  cells  were  produced.  The  average 
number  of  documents  per  cell  is  46821/249  or  188  docu¬ 
ments  per  cell.  Including  the  apex  and  cells  as  levels,  the 
resulting  hierarchy  tree  varies  from  three  to  seven  levels 
(if  it  were  a  balanced  tree,  it  would  have  had  four  to  five 
levels).  The  number  of  nodes  at  each  level  for  the  actual 
tree  and  a  balanced  tree  is  shown  below.  The  numbers  in 
parentheses  represent  the  number  of  terminal  nodes,  or 
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cells  produced  at  each  level. 


Actual  Tree 

Balanced 

Tree 

Level 

Total  Nodes 

Cells 

Total  Nodes  Cells 

1 

1 

1 

2 

5 

5 

3 

25 

(4) 

25 

4 

105 

(86) 

125 

(94) 

5 

95 

(82) 

155 

(155) 

6 

65 

(62) 

- 

7 

JJi 

Jill 

■M-l— 

Totals 

311 

(249) 

3H 

(249) 

Despite  a  few  nodes  wi 

th  few  or  no 

keywords 

(e.g.  , 

nodes  1 

and  1.2  have  no  keywords),  In  most 

cases  it 

was  not 

too  diff 

'icult  to  summarize  the 

keywords  at 

a  node. 

Figures 

5-13.  5-14,  and  5-15  show  lists  of  some  of  the  keywords  at 
various  nodes  along  with  the  node  numbers  and  the  manually 
formed  titles  for  the  nodes  (the  titles  are  Just  under  the 
node  numbers).  The  sample  nodes  were  ohooen  to  Illustrate 
the  hierarchical  nature  of  a  tree.  For  example  (see  Fig. 
5-13).  node  1,5.1.  Organic  Cnamistry.  is  under  node  1.5. 
Chemistry.  Also  node  1.2. 3. 2,  Fission  Products,  is  under 
node  1.2.3.  Nuclear  Explosions.  All  the  nodes  4>f  Fig.  5-14 
are  In  the  same  region  of  the  tree  (under  but  not  necessari¬ 
ly  directly  under,  node  1.1.1)  and  henbe  are  relatively  close 
In  subject  content.  Figure  5-15  shows  nodes  at  three  levels 
of  the  bottom  portion  of  part  of  the  hierarchy.  Nodes 


Node 
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Sample  Nodes  of  Hierarchy 
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Node  1.2. 2. 2. 1.2. 2  (cell)  Node  1.2. 2. 2. 1.2. 3  (cell) 


PARTICLE  MODELS 


RESONANCES  AND  STRANGENESS  PARTICLE  THEORIES 


particle  models 
elementary  particles 
mass 

scattering  amplitude 

vectors 

protons 

production 


Node  1.2. 2. 2. 1.1 

THEORETICAL  PHYSICS 

quantum  field  theory 
relativity  theory 
field  theory 
group  theory 
tensors 

invariance  principle 

sum  rules 

parity 

fermions 

SU  group 

photons 

decay 

spin 

cross  sections 


N*  resonances 
XI  resonances 
F  resonances 
X*  rescnanees 
Y*  resonances 
strangeness 
strange  particles 
transients 
decay 

energy  levels 

phase  shift 

kaons-neutral 

pions-plus 

pions-minus 

kaons 

kaons-plus 

hadrons 

omega  particles 
omsga-olnus 
antinucleons 
hyper fine  structure 
tensors 
bound  state 
Schrodinger  equation 
branching  ratio 
time -space 
singularity 
veak  interaction 
magnetic  fields 
KEV  range 

differential  equations 
magnetic  moments 
nuclear  theory 
monte  carlo  method 


Node  1.2. 2. 2. 1.2 
STRONGLY  INTERACTING  PARTICLES 


baryons 

mesons 

hyper on s 

quarks 

lsospin 

matrices 

plons 

electric  charges 

elastic  scattering 

mathematics 

angular  distribution 

energy 

spectra 


bootstrap  model 
current  algebra 
field  theory 
group  theory 
SU-2  group 
SU-3  group 
SU  group 
SU- 12  group 
0  group 
0-3  group 
S-matrlx 
S-vave 
strangeness 
Legendre  functions 
Feynman  diagram 
conservation  lavs 
hyper  fine  structure 
Mossbauer  effect 
parity 

Inelastic  scattering 
coupling  constants 
Regge  poles 
dispersion  relation 
sum  rules 
statistics 
transients 
phase  shift 
G-parity 
selection  rules 
Hamiltonian  operator 
pair  production 
integrals 
orbits 

annihilation 

spin 

equation 

lattices 

cross-sections 

exitatlon 

crystals 

ragnetl3m 

neutrons 

kaons 

kaons -minus 

pions-mlnua 

fermions 

bosons 

antiprotons 

phonons 

antinucleons 

omega  psrticle 

omega -sinus 

antihyperons 

GEV  range 


Figure  5-15 

Suaple  Nodes  of  Hierarchy  LIT 
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1 .2. 2. 2.1 .2.2  and  1 . 2. 2.2.1 .2. 3  are  terminal  nodes,  or 
cells. 

The  entire  hierarchy  Is  too  extensive  to  be  shown 
here.  Therefore,  two  sections  of  It  have  been  selected  and 
are  displayed  In  Figures  5-16  and  5-17,  For  the  sake  of 
clarity,  only  the  node  numbers  and  manually  generated  node 
titles  are  shown.  The  tree  segment  of  Figure  5-16  contains 
the  nodes  shown  in  F' gure  5-1^  and  that  of  Flgw re  5-17 
contains  the  nodes  shown  in  Figure  5-15*  In  the  tree  seg¬ 
ments,  a  dashed  line  represents  the  path  to  another  node 
and  a  "c"  represents  the  path  to  a  cell.  All  the  nodes  of 
the  tree  segments  shown  are  not  labelled  down  to  the  cell 
level  because  (a)  all  the  labels  could  not  fit  in  the 
diagram  and  (b)  all  the  nodes  were  not  Inspected  for  title 
assignment  because  of  the  large  numbers  of  keywords  and 
nodes  (311)  Involved  In  this  manual  process. 

The  quality  of  a  hierarchy  should  be  measured  by 
Its  convenience  to  the  user,  and  not  by  how  closely  It 
matches  on  a  priori .  manually  produced  one.  Therefore, 
this  hierarchy  will  not  be  compared  to  the  one  of  the 
human  classification  described  In  Section  4.6.  Tho  final 
evaluation  of  hierarchies  such  as  this  will  have  to  await 
on-line  tests  of  the  type  described  In  the  beginning  of 


this  section. 


Health  Physics 
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'5*5*3  Documents  in  Cells 

Most  of  the  documents  of  each  cell  nre  very  close  In 
subject  area.  For  example,  consider  the  document  des¬ 
criptions  shown  in  Figure  5*1 8*  All  of  these  documents 
were  placed  into  terminal  node  1.1. 1.1. 1.1,  the  leftmost 
cell  shown  under  node  1.1. 1.1.1,  Radiation  Effects  and 
Protection,  of  Figure  5-16.  The  keyword  numbers  have 
been  converted  back  into  the  English  keywords.  The  docu¬ 
ments  shown  lr  Figure  5-18  represent  10  out  of  391  docu¬ 
ments  in  that  cell.  The  documents  in  the  cell  contain 
2822  keyword  occurrences,  but  only  21 6  distinct  keywords 
(the  ten  documents  of  the  example  have  73  keyword  occur¬ 
rences  and  20  distinct  keywords). 

It  is  interesting  to  compare  this  grouping  of  docu¬ 
ments  by  CLASFY  ch  that  of  the  NSA  classification  system 
(from  which  the  "human"  system  was  derived).  In  Nuclear 
Science  Abstracts,  7  out  of  the  10  documents  were  clap  i- 
fied  under  "Radiation  Effects  on  Plants".  Documents  *?89 
and  3^631  were  classified  under  "Genetics  and  Cytogenetics" 
while  ^1382  can  be  found  under  "Eoology".  It  should  be 
noted  that  documents  22785  and  3^*631  ,  while  placed  in 
different  NSA  categories,  agree  in  7  out  of  8  keywords 
and  are  both  concerned  with  mutations  of  barley.  In  fact, 
document  41332,  which  was  placed  in  a  third  category,  also 
has  many  keywords  in  common  with  these  two  documents  and 


discusses  the  same  subject 
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Figure  5-18 

Portion  of  Terminal  Node  1.1.1 
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One  should  not  conclude  from  the  above  that  one 
set  of  categories  for  these  documents  Is  better  than  the 
other,  but  rather  that  there  is  more  than  one  "iecr.sonableB 
way  of  classifying  documents  and  that  a  manual,  a  priori 
system  does  not  recessarl ly  categorize  documents  better 
than  an  automatic  system. 

Naturally,  with  391  documents,  the  subject  ecope 
of  this  cell  Is  much  broader  than  that  described  by  these 
10  documents,  and,  In  fact,  contains  documents  on  radiation 
effects  on  man  as  well  as  on  plants.  The  tvro  fields  would 
probably  be  spilt  apart  if  more  t..ar,  249  cells  were  desired* 

Unfortunately,  automatically  derived  categories  are 
prone  to  grossly  misclasslfying  a  number  of  documents  or 
groups  of  documents.  While  manual  systems  are  not  exempt 
from  errors,  gross  errors  are  rare.  An  example  of  this  is 
a  few  documents  on  measurement  and  detection  of  cosmic 
radiation  which  were  placed  into  the  cell  under  discussion. 
Evidently  they  were  placed  into  this  cell  via  documents 
concerned  with  the  effects  of  cosmic  radiation.  However,, 
it  Is  obvious  that  a  better  section  of  the  hierarchy  for 
these  documents  would  be  In  a  cell  under  node  1.2. 2. 1,1, 
Cosmic  Radiation  and  Detection,  shown  in  Figure  5-17«  It 
la  believed  that  the  number  of  such  mlsclas slf lcatlons  can 
be  greatly  reduced  by  modifying  pass  3  of  the  CLA3FY  algo¬ 
rithm  to  place  documents  with  redundant  descriptions 


logically  Into  a  particular  group  instead  of  arbitrarily 
selecting  a  group  as  is  sometimes  done  at  present  (see 
Section  4.2.1 ). 

5.6  Summary  of  Results 

The  experiments  described  in  this  cnapter  led  to 
consistent  results  which  enables  one  to  rank  the  quality 
(with  respect  to  machine  retrieval'*  of  the  classification 
systems  studied. 

As  e:;oected,  all  other  ay  stems  outperformed  the 
random  classification.  Next  in  increasing  order  of  quality- 
come  the  reverse  and  then  the  forward  classifications. 
Unfortunately,  these  relatively  simple  classification 
schemes  are  not  nearly  as  good  as  more  sophisticated  tech¬ 
niques,  such  as  that  embodied  by  CLASFY. 

The  results  using  CLA3FY  are  uniformly  good  regard¬ 
less  of  input  ordering;  however,  CLASFY  performs  best  with 
the  input  file  in  reverse  order.  It  van  found  that  the  dif¬ 
ferences  in  results  caused  by  the  order  (based  cn  three 
orderings;  forward.,  reverse,  and  random)  in  which  the 
documents  are  presented  to  CLASFY  decreases  as  the  size  of 
the  collection  Increases.  With  respect  to  retrieval  effi¬ 
ciency,  CLASFY  and  the  human,  a  priori  system  are  very  close 


in  quality.  CLASFY  is  slightly  better  in  the  number  of 
cells  searched,  while  the  human  classification  is  slightly 
better  with  respect  to  the  number  of  documents  searchedc 
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i 

*  This  means  that,  because  cell  access  time  Is  usually 
longer  then  incremental  document  transmission  and  search 
time,  if  one  system  is  to  be  ranked  above  the  other,  the 
edge  would  have  to  be  given  to  CLASFY. 

The  hierarchies  produced  by  CLASFY  were  not  com¬ 
pared  with  those  produced  by  any  other  system.  However, 
subjectively  they  seem  to  be  quite  "reasonable"  and  could 
be  very  useful  for  on-line  browsing.  The  placement  of 
documents  into  particular  cells  (categories),  while  not  al¬ 
ways  in  agreement  with  manually  derived  document  place¬ 
ments  (agreement  la  not  necessarily  desirable).  In  most . 
but  not  all,  cases  is  quite  satisfactory. 

For  comparison  purposes,  the  entire  hierarchy  of  & 
claspif leatlon  similar  to  the  one  discussed  here  is  included 
as  Appendix  C  of  this  dissertation. 

5.7  Bonus  itasult  -  Avci-ga  Length  of  Search  in  Serial  Files 
A  question,  ralatlnj  to  serial  files  has  come  to  the 
attention  of  the  author  and  while  not  directly  related  to 
automatic  classification,  c&u  easily  be  answered  as  a  re¬ 
sult  of  some  of  the  experiments  performed  hare. 

Fossum  and  Kaskey  [46],  among  ethers  [36,145], 
have  proposed  organizing  serial  files  via  some  form  of 
keyword  ordering  in  order  to  avoid  having  to  search  all 
the  documents  In  the  file  for  each  request.  For  example ,, 
consider  the  following  documents  ordered  by  increasing 
keyword  numbers i 


— 1 50- 


Document 


Keywords 


C 


12  3  4 


F 

A 

3 

D 

E 


1  2  5 

14? 

2  3  6 

2  5  ? 

3  6 


If  a  request  of  keywords  1  AND  3  Is  entered,  the  serial 
search  may  stop  after  three  documents  (i.e.,  after  docu¬ 


ment  A)  for  one  Is  guaranteed  that  "1"  does  not  appear 
further  on  In  the  file. 

Fossua  and  Knskey  r48 ]  state: 

"Does  this  approach  have  any  significant 
potential  in  a  document  retrieval  application? 
Unquestionably,  it  permits  terminating  a  search 
without  examining  all  the  documents  in  the  file 
end,  from  this  standpoint,  is  preferable  to  a 
straight  document-sequenced  organization.  The 
percentage  of  the  file  records  that  can  be  by¬ 
passed,  on  the  average,  has  not  been  reported. 
In  fact,  so  far  as  known,  the  proposal  has  not 
been  tested  against  an  actual  file  of  document 
descriptions  and  a  representative  sample  of 
search  requests." 


In  the  experiments  reported  here,  such  a  file  of 
document  descriptions  and  sample  of  search  requests  were 
available.  The  forward  and  reverse  orderings  (before 
modification)  are,  in  fact,  orderings  of  the  file  based 
on  keyword  numbers. 

The  165  search  requests  were  applied  to  the  for¬ 
ward  and  reverse  orderings  of  both  large  files.  The  per- 
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centage  of  the  files  serially  searched  per  request  Is  as 
followst 

Percentage  Documents  Searched 

File  Order  _ per  Retrieval  Request 

Large  Keyword  Forward  95.4 

Reverse  84. 0 

Title  Word  Forward  91.4 

Reverse  85.5 

.The  numbers  only  «pply  to  Individual  requests  as  In  an 
on-line  system.  Batching  of  requests  would  eliminate  any 
advantages  to  be  gained  by  file  ordering. 

No  attempt  was  made  to  optimize  the  file  ordering 
or  the  keyword  numbering  in  order  to  minimize  the  number 
of  documents  searched.  However,  based  on  the  above  results, 
it  would  seem  that  one  would  not  be  able  tc  reduce  the 
percentage  of  documents  searched  below  about  80$  by  keyword 
and  document  ordering.  This  reduction  is  not  enough  to 
allow  the  use  of  serial  files  in  large-scale  on-llno  IS&R 
systems.  However,  small  on-line  systems,  where  there  are 
few  enough  documents  to  allow  for  serial  searching,  might 
be  able  to  profit  from  the  above  file  organization. 


CHAPTER  6 

CONCLUSIONS  AND  SUGGESTIONS  FOB  FUTURE  RESEARCH 


6.1 .  General  Conclusions 

On  the  basis  of  the  experiments  on  almost  50*000 
documents  described  In  Chapter  5.  It  Is  concluded  that  auto¬ 
matic  classification  can  go  a  long  way  towards  solving  some 
of  the  problems  of  large-scale  information  storage  and  re¬ 
trieval. 

An  a  posteriori  automatic  classification  system 
(CLASFY)  has  been  described  which  was  shown,  by  a  number  of 
different  measures,  to  be  at  least  equal  in  classification 
quality  to  a  manual,  a  priori  classification  system.  How¬ 
ever,  because  of  its  automatic  and  flexible  nature,  It  Is 
felt  that  automatic  classification  can  be  vastly  superior 
to  any  manual  system.  This  statement  Is  made  taking  Into 
account  the  realization  that  CLASFY,  while  a  perfectly  re¬ 
spectable  system,  can  stand  some  improvements  and  Is  pro¬ 
bably  far  from  the  ultimate  (if  there  Is  such  a  thing)  In 
automatic  classification  systems. 

Regardless  of  the  quality  of  classification,  an 
automatic  classification  system  will  only  bo  used  if  the 
classification  time  required  for  large  files  is  reasonable. 
It  was  found  (see  Section  5*3)  that  CLASFY  took  about  1^ 
hour.*  per  tree  level  to  classify  about  50*000  doc\unent  des¬ 
criptions  on  an  IBM  7040  computer.  It  Is  e::pectec.  that 
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thl*  time  could  oe  reduced  by  at  least  an  order  of  magnitude 
by  the  use  of  a  modern,  high-speed  computer  (it  could  be 
further  reduced  by  mului-procesaing  -  a  technique  to  which 
CLASFY  lends  Itself  as  a  result  of  the  independence  of  pro¬ 
cessing  at  each  node).  This  is  because  of  the  relatively 
slow  processing  speed  (basic  cycle  time  »  8  microseconds, 
add  time  =  16  microseconds)  of  the  ?040  and,  of  even  greater 
importance,  the  relatively  alow  and  limited  secondary  sto¬ 
rage  facilities  available  (at  least  one  fourth  of  the  time 
was  spent  In  Ju3t  copying  data  from  disk  to  tape  and  re¬ 
winding  tapes,  processes  which  would  not  have  to  be  done  if 
more  disk  were  available).  In  addition,  the  ?040  used  for 
these  experiments  was  being  operated  on-line,  l.e.,  all 
processing  stopped  during  printing  of  the  node  summaries 
(see  Figure  4-2). 

Table  6-1  presents  the  approximate  classification 
times  required  by  CLASFY  to  operate  on  the  files  of  the  ex¬ 
amples  of  Tables  1-1,  1-2,  and  2-1.  As  seen  In  Table  6-1, 
the  time  required  to  classify  10^  and  10?  documents  should 
be  no  more  than  12  and  120  hours,  respectively.  This  Is 
quite  reasonable,  especially  considering  that  the  number  of 

7 

books  In  the  Library  of  Congress  Is  about  10  .  These  times 
represent  .043  seoonds  per  document. 


Number  of  documents,  Nd 

106 

107 

Number  of  cells,  N„ 

104  5 

■  104 

Number  of  documents  per  cell*  Ndc 

100 

200 

Stratification  number,  N  for  log^N^ 
w  7  (see  last  paragraph  of  Section 

4. 2. 4.1, ) 

7 

10 

Average  number  of  classlflc  *tlon 
levels  =  number  of  ltve's  In 
tree  minus  one 

4.7 

4  7 

Approximate  7040  time  per  level 

25  hrs. 

250  i  rs 

Approximate  total  7040  time  required 

120  hrs. 

1200  hi  3 

Maximum  time  required  using  modern, 
high-speed  computers 

12  hrs. 

120  hrs 

Table  6-1 

Classification  Time  for  10^  to  10?  Documents 

using  CLASPY 
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6.2  Future  Research 

There  are  a  number  of  directions  to  further  researoh 
in  the  area  of  automatic  classification,  some  obvious  and 
some  not  so'  obvious. 

One  obvious  direction  is  to  improve  the  CLASFY 
algorithm.  Some  means  for  accomplishing  this  were  mentioned 
or  Implied  in  the  text  of  this  paper.  Another  direction 
of  research,  also  slanted  towards  CLASFY,  is  to  establish 
an  on-line  IS&R  system  using  CLASFY  for  automatic  classifi¬ 
cation.  This  would  enable  one  to  obtain  user  reactions, 
particularly  with  respect  to  the  quality  and  utility  of 
browsing. 

Other  classification  systems  should  be  doslgned 
(some  already  exist)  and  applied  towards  the  classifica¬ 
tion  of  large  files  such  as  the  one  used  in  these  experi¬ 
ments  and  then  compared  on  the  basis  of  tho  measures  des¬ 
cribed  in  Section  5*2.  If  this  could  be  done  with  reason¬ 
able  uniformity,  an  IS&R  system  designer  will  have  a  basis 
upon  which  to  select  one  classification  system  over  another, 
something  which  la  lacking  at  present. 

Another  area  of  reasoarch  is  that  of  retrieval 
statistics.  It  1 8  desirable  to  have  some  idea  of  how  many 
documents  will  be  retrieved  before  the  actual  retrieval 
takes  place.  A  user  might  modify  a  request  based  upon  this 
number.  For  example,  if  2000  documento  are  estimated  for 
retrieval,  a  user  would  probably  want  to  n  irrow  tne  ro- 
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' quest  before'  the  actual  retrieval  takes  place.  In  an  in¬ 
verted  file,  one  can  obtain  complete  statistics  at  the 
expense  of  manipulating  long  lists  of  document  numbers. 

In  a  serial  file,  on  the  other  hard,  few,  if  any,  statistics 
are  available. 

The  retrieval  statistics  for  a  file  organized  on 
cells  are  somewhat  In  between  those  of  the  serial  and 
Inverted  files.  One  knows  how  many  cells  will  be  accessed 
and  how  many  documents  are  In  those  cells,  but  not  how 
many  documents  In  those  ^ells  will  satisfy  the  search  re¬ 
quest.  This  can  be  estimated  a  number  of  different  ways, 
some  of  which  are: 

1)  On  the  basis  of  the  average  number  of  doouments 
searched  per  retrieval  (see  Table  5-2  of  Seotlon 

5.4.3). 


2) 

3) 

The  search 
statistics 


On  the  basis  of  the  number  of  doouments  re¬ 
trieved  from  searching  a  sampling  (maybe  ten 
percent)  of  the  cells  to  be  searched. 

On  the  ba3ls  of  the  number  of  request  keywords 
found  in  the  cells  which  satisfy  the  request, 
for  the  method  which  achieves  the  best  retrieval 
1 8  an  interesting  subjeot  for  future  Investi¬ 


gation 


APPENDIX  A 


NUCj-.EA;i  SCIENCE  ABBTrtaCTS  DATA  FILES 


A.l  Source  of  the  Data 

The  data  used  in  these  experiments  was  obtained 
from  the  Atomic  Energy  Commission,  Division  of  Technical 
Information  Extension  through  the  aid  of  ooel  O'Connor, 
formerly  Chief,  Computer  Operations  Branch  of  the  above 
division.  The  data  comprises  parts  of  two  out  of  three 
sets  of  data  made  available  by  the  AEC  upon  request  by 
qualifying  resecrch  projects.  These  files  are  in  nacnine- 
readable  form  on  magnetic  tape. 

These  files  provide  data  on  each  document  ab¬ 
stracted  in  Nuclear  Science  Abstracts  (NSA).  It  should  be 
noted  that  the  three  flies  contain  different  aspects  of 
tne  same  documents.  The  three  files  consist  of; 

1)  Keyword  File.  Document  ide.A  if lcation  plus 
descriptors  (cabled  "selectors"  oy  the  AEC) 
manually  indexed  from  tne  E'JiiATOX  Thesaurus 
[uA,b?~.  Discussed  ir.  detail  in  Section  A. 2. 

2)  Entry  rile.  Document  identification  plus 
bibliographic  data  sur.n  as  title,  o  priori 
category,  language ,  Journal  citation,  ton- 
tract  number,  etc.  Discussed  in  detail  ir. 


^ lor  A .  . 

Suoject  Heading  rile. 


:  e  r.  u 


plus  subject  headings  used  to  5nriex  documents 
In  NSA.  There  are  about  33.000  headings  [l4l] 
with  an  average  of  about  four  being  used  to 
Index  each  document.  This  file  was  not  used 
in  these  experiments  and  therefore  will  not 
be  described  In  any  further  detail. 

The  actual  files  obtained  for ‘this  research  (on 
seven  magnetic  tapes)  are  the  Keyword  and  Entry  (titles) 
files  of  NSA  Volume  22,  Number  3  (February  15,1968)  - 
2258  documents  and  Volume  21  (196?)  -  4?, 055  documents. 

The  Entry  file  of  Vol.  22,  No.  3  was  not  used  in  the  ex¬ 
periments. 

A. 2  Keyword  Files 

Each  document  covered  In  NSA  Is  assigned  3URAT0M 
Indexing  terras  by  subject  specialists.  This  is  done  in 
ord  r  to  add  those  documents  covered  by  NSA  to  the  col¬ 
lection  of  the  Center  for  Information  and  Documentation 
(CID)  of  the  European  Atomic  Energy  Community  (EUHATOM). 

As  of  September  1966,  that  collection  held  360,000 
documents  and  was  growing  at  the  rate  about  120,000 
documents  per  year.  Rolling  [111]  presents  a  description 
of  the  EURATOM-CID  system  (batched  searches  on  a  serial 
file  are  performed). 

The  EURATQM  Thesaurus  [46,4?]  is  similar  00  other 
current  thesauri  [58,92,137  ]  except  that  the  usual  forms 


of  cross  referencing  (l.e.,  related  broader,  a r.a  narrower 
terms)  are  presented  graphically.  The  thesaurus  contains 
15.695  usable  terms  plus  3.^68  "forbidden"  terms.  How¬ 
ever,  In  index’ ng  for  N3A,  a  number  of  additional  terms 
were  used  where  appropriate.  If  deemed  desirable,  some  of 
these  terms  will  be  incorporated  In  future  editions  of 
the  Thesaurus.  As  of  early  1968,  15.517  different  Index 
terms  had  been  assigned  to  NSA  documents. 

The  keyword  flie[l42]  contains  each  document's 
abstract  number,  type,  assigned  category  (called  "Section 
Subsection  code),  and  list  of  keywords.  The  abstract 
number  is  used  to  Identify  the  documents.  The  types  of 
documents  Indexed  are:  books,  theses,  conference  papers, 
engineering  materials  letters,  Journal  literature,  patents, 
reports,  and  translations.  The  type  information  was  not 
used  In  these  experiments.  The  NSA  categories  are  an 
a  priori  classification  of  the  knowledge  covered  by  the 
documents  abstracted  In  NSA.  There  are  almost  300  cate¬ 
gories.  This  Is  the  a  priori  classification  referred  to 
in  Chapters  4  and  5  (see  Section  4.6  for  more  details). 

In  addition  to  the  actual  English  keywords  and 
the  NSA  codes  assigned  to  them,  the  keyword  list  also 
Includes  "link"  (called  "split"  In  AEC  literature)  in¬ 
formation.  The  function  of  links  is  to  group  keywords 
such  that  the  keywords  in  a  group  represent  topics  covered 
in  the  paper.  By  eliminating  retrieval  on  conjunctions 


~  k  t  J  — 

Involving  keywords  in  different  Units,  one  reduces  the 
number  of  "false  drops"  in  response  to  a  query.  The 
wisdom  of  using  links  (and  "roles")  is  the  topic  of  nu¬ 
merous  papers  [  32,66,66,123,134,144  ].  some  for  ana  come 
against.  The  position  taken  here  is  that  the  utility  of 
links  i^  probnbly  reduced  in  a  hierai 'hical ,  on-line  system. 
Hence,  all  link  information  was  deleted  in  processing  the 
keyword  files. 

Due  to  missing  data  and  bad  tape  records,  net  all 
the  documents  from  each  file  could  be  processed.  In 
addition,  for  reasons  of  speed  and  economy  of  storage, 
documents  with  .more  than  47  keywords  were  deleted.  Since 
this  represented  only  0.1 6#  and  0.39#  of  the  small  and  large 
files  respectively,  this  should  have  little  effect  on  the 
experiments.  The  actual  number  of  documents  used  were 
2254  (out  of  2258)  and  46,821  (out  of  47,055). 

Figure  A-l  presents  a  macro-flowchart  of  the  pro¬ 
cedure  used  to  prepare  the  keyword  files  for  classification 
experiments.  All  file  processing  was  performed  on  an  IBM 
360A5  using  PL/I.  The  result  of  this  process  rnc  to  re¬ 
place  the  English  keywords  with  rank  numbers  which  cor¬ 
responded  to  the  frequency  of  the  English  word.  For  ex¬ 
ample,  the  keyword  "reactors"  was  replaced  by  its  order 
number,  "1".  Figure  A-2  shows  the  200  highest  occurring 
(large  file)  keywords  and  their  frequencies  of  occurrence. 
Now,  not  only  can  the  new  keywords  (i.e.,  numbers)  be 
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easily  manipulated  by  computer,  out  Keywords  can  be  com¬ 
pared  on  tne  basis  of  frequency  by  Inspection.  Figure 
A— 3  presents  a  sample  of  the  alphabetic  listing  of  English 
keywords  along  with  their  corresponding  frequencies  a,.’ 
keyword  numbers.  Identical,  but  Independent,  processing 
was  done  for  the  small  keyword  file. 

Statistics  for  these  files  can  be  found  in  Section 

A. 4. 


A. 3  Entry  (tit  1  word)  File 
A. 3.1  Nature  of  the-  File 

Tne  NS A  Entry  File [140 ,142] contains  each  document's 
abstract  number,  type,  n,-  igried  category,  title,  and  other 
bibliographic  Information  depending  upon  the  type  of 
document.  The  type  and  category  are  the  same  as  described 
in  the  previous  section.  A  semi-automatic  procedure  was 
used  to  obtain  index  terms  from  the  document  titles.  How¬ 
ever,  at  times,  the  above  olbliographic  material  included 
a  "short  title".  A  short  title  is  compose!  by  an  evalu¬ 
ator  (not  the  author;  when  the  original  full  title  is  not 
suitable.  This  can  occur  under  various  circumstances,  such 
f>2  a  foreign  language  title,  a  title  which  includes  a 
subtitle,  iengthly  titles,  and  oases  where  uniform  abbrev¬ 
iation  of  words  is  desirable.  However,  since  the  short  ti¬ 
tle  usually  did  not  change  tne  significant  words  of  the  ti¬ 
tle  but  d 1 d  simplify  processing,  the  snort  title  was  used 
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whenever  founds  Abstracts  woula  have  oeen  preferred  (see 
end  of  Chapter  3).  hut  they  were  not  available  In  machine- 
readable  form. 

There  were  no  experiments  performed  on  the  small 
Entry  file  (henceforth  call  the  title  word  file).  Due  to 
missing  data  and  bad  tape  data  only  47,002  out  of  ^7, 055 
documents  in  the  large  file  could  be  processed.  In  ad¬ 
dition,  after  processing  it  was  found  that  60  documents 
(0.13$)  had  no  significant  title  words.  Therefore  the  final 
title  word  file  contained  46,942  documents. 

A . 3 • 2  Semi-Automatic  Indexing 

The  above  document  titles  now  had  to  be  analyzed 
to  obtain  significant  words  which  can  be  r  -ed  as  keywords. 
The  first  steo  is  to  break  the  titles  up  into  individual 
words.  The  ten  hr^ak  cnaracters  used  for  this  purpose  rere : 
(blank)  .  )  +  -/,(  =  and  '.  Two  special  cases  had  to 
be  taken  *e  of.  rst  occurs  when  possession  is 

indicated  such  as  in  "COW'S  MILK".  It  is  undesirable  to 
break  this  up  as  "COW",  "5",  and  "MILK"  since  "S"  is  the 
abbreviation  of  SULFUR.  Therefore,  an  S  following  an  * 
(apostrophe)  was  ignored.  The  second  spec  it.  1  case  is 
peculiar  to  th  >s  (and  other  similar)  collect. lor  Because 
of  the  Inability  of  most  computers  to  recognize  subscripts 
and  superscripts,  chemical  expressions  such  as  the  com¬ 
position  of  water  (HoC)  and  a  strontium  isotope  (  '‘■’Si) 


*  *  L/U 


are  represented  as  ri/SUt  2/C  and  /dUP  9U/~H  respectively. 

At  times  this  representation  gets  quite  complex  (i.e,, 

/SUP  238/PUO/SUB  2/  or  even  3D/SU?  10/9p/oU?  2/P/5UB  /SUP 
3///SUB  2//)„  The  title  partitioning  program  was  designed 
to  recognize  these  circumstances  (by  recognizing  "/SUB"  or 
"/SUP")  and  to  consider  these  expressions  as  single  words 
by  Ignoring  any  break  characters  (such  as  blanks)  between 
slashes. 

Table  A-l  presents  the  steps  involved  In  proces¬ 
sing  the  title  word  file  Including  tne  total  number  of 
terras  (words  or  word  stums)  ar.d  the.  number  of  discrete 
terras  fo”nd  at  each  stage  of  processing.  A  number  of  these 
steps  were  combined  in  the  actual  processing  but,  l.r 
simplicity,  are  shown  separately  In  the  table. 

A  stop  list  was  formed  to  eliminate  some  of  the 
common  words  which  could  not  be  used  as  keywords.  This 
list  was  obtalr ?d  by  considering  other  available  stop  lists 
[112]  and  by  noting  tne  most  common  words  in  the  first  900 
documents  of  this  collect' or..  A  comparison  of  the  twelve 
most  often  occurring  co’raon  words  with  the  t )p  twelve 
words  found  in  a  recent  analysis  of  1,01a, 23?  words  of 
running  text  (broad  crcss-sect Ion  of  subjects)  Is  shown 
in  Table  A-2.  It  should  be  noted  tnat  tne  tjp  12  out  of 
169  stop  words  accounted  for  79  percent  of  tne  deleted 


title  words. 
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Function  Performed  Number  of  ?oca±  rur. 

(from  ton  down)  discrete  terns  of  -*'nt 


Break  titles  into  words 

20?dd 

u }1 022 

Delete  words  on  stop  list 

-  180 

-  13-533 

20156 

2954  69 

Eliminate  dnniic««;e  words  for 
eacn  document 

,--15.6.0 

2919^9 

Reduce  terms  oy  stem  analysis 

-  lc2~ 

1 71 32 

Eliminate  duplicate  stems 
for  each  document 

-  210 
291719 

Delete  stems  on  new  stop  list 

-J_7.il 

1  5377 

-  14300 

277339 

Reduce  stems  by  simulated 
synonym  dictionary 

-  2068 

Eliminate  duplicate  stems 
for  each  uocument 

-  1Q8 

Final  Totals 

13.309 

2  77141 

Table  A-l 

Steps  Involved  in  Processing  Title  Word  File 


Common  Word  Cca pari son 
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The  third  item  of  Table  A-i  (It  appears  twice  more 
in  the  table)  eliminates  duplicate  words  for  each  document. 
For  example.  If  the  last  sentence  of  the  previous  para¬ 
graph  was  processed,  or.e  occurrence  of  the  word  "words" 
would  have  been  deleted  (all  occurrences  of  "the"  and  "of" 
would  have  been  deleted  by  the  stop  list). 

In  order  to  normalize  the  vocal  lary  to  some  extent, 
a  stem  analysis  program  was  written.  This  program,  a 
modified  version  of  one  used  in  the  General  Inquirer  [131, 
132],  removes  a  number  of  different  suffixes.  Project 
SMART  uses  a  table  look-up  procedure  which,  while  more 
effective,  seems  to  take  considerably  more  computer  time 
[27,81,119],  A  flowchart  of  this  program  is  shown  In 
Figure  A-4.  The  operation  of  this  program  removed  enough 
suffixes  to  cause  a  reduction  of  3^23  In  the  number  of 
discrete  title  words.  Suffixes  removed  by  this  program 
are;  s,  e,  es,  ed,  lng,  ings,  Ion,  ions,  ly,  edly,  Ingly, 
plus  a  doubled  letter  immediately  followed  by  ed  or  lng. 

In  addition,  les,  led  and  ily  are  replaced  by  the  single 
letter  y.  However,  In  order  to  prevent  the  shortening  or 
complete  disappearance  of  short  words  ;l.e.,  ion,  gas,  bee, 
wing,  etc.),  word  length  is  not  reduced  below  three  letters. 
As  a  example  of  suffix  removal,  consider  the  following 
actual  title  words  (frequency  of  occurrence  are  parenthe¬ 
sized)  : 
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ESTIMaTE  (11) 

estimated  (5) 

ESTIMATES  (19) 

ESTIMATING  (2?)  ' 

ESTIMATION  (58) 

ESTIMATIONS  (1). 

These  words  were  all  reduced  to  the  single  stem,  ESTIMAT 

(121). 

The  final  steps  In  tne  process  of  obtaining  key- 
,  word  stems  were  aided  by  manual  Intervention.  A  new  stop 
list  was  formed  and  a  synonym  dictionary  was  simulated 
after  Inspection  of  the  list  of  1?1?2  stems  (see  Table  A-l). 
In  an  actual  operating  system,  this  need  be  done  only  once 
with  few  additions  as  the  collection  grows.  About  1000 
of  the  1755  Items  on  the  new  stop  list  were  numbers.  Many 
of  tne  synonyms  Involved  suffix  deletions  or  changes  and 
could  have  been  Incorporated  In  a  more  complete  suffix 
removal  program  (e.g.  ,  combination  of  electroly tical  with 
electrolytic).  Others  involved  spelling  errors  and  com¬ 
bination  of  British  and  American  forms  of  the  same  words. 
Still  others  Involved  the  combination  of  aobrevlated  forms 
and  non-abbrevlated  forms  of  the  same  word  or  combination 
of  chemical  and  English  terras  (e.g.,  converting  H/SUB  2/0 
and  2H/SUB  2/0  to  WATER). 

Once  the  keyword  stems  were  obtained,  they  were  con¬ 
verted  to  numbers  In  the  same  manner  as  was  done  for  the 
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keywords  of  the  keyword  file  (Fig,  A-l ) .  The  200  highest 
occurring  word  stems  are  shown  In  Figure  A«5  (compere  with 
those  of  the  keyword  file,  Figure  A-2}.  Full  statistics 
for  the  title  word  file  can  be  found  In  the  next  section. 

A. 4  File  Statistics 

Pertinent  statistics  for  the  three  files  studied  are 
presented  In  Table  A-3.  One  is  reminded  that  the  docu¬ 
ments  of  the  large  keyword  file  and  the  title  word  file  are 
essentially  the  same,  both  belonging  to  a  collection  of 
47,055  documents.  It  should  be  noted  that,  as  expected, 
the  proportion  of  keywords  with  a  single  occurrence  decreased 
from  the  small  to  the  large  keyword  file.  Also  as  expected, 
the  proportion  of  unique  keywords  was  highest  for  the  title 
word  file. 

In  Figure  A-6  keyword  frequency  Is  plotted  against 
rank  (i.e.,  keyword  order  number).  Zlpf  [155]  found  that 
when  this  curve  was  plotted  for  words  of  running  text,  a 
straight  line  resulted  (Zipf’s  Law).  However,  as  can  be 
seen  in  Figure  A-6,  this  does  not  hold. for  document  Index 
words.  Houston  and  Wall  [60 ]  found,  however,  that  when 
term  ’frequency  was  plotted  on  logarithmic  probability  paper, 
a  linear  relationship  was  found  to  exist  up  to  about  the 
9  5th  percentile.  The  fact  that  this  log-no;;-mal  relation¬ 
ship  holds  for  the  files  under  consideration  is  shown  in 
Figure  A-7. 
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Based  on  ten  systems,  all  of  which  follow  this  log¬ 
normal  relationship,  Houston  and  Wall  went  on  to  develop 
an  expression  relating  vocabulary  size  to  the  total  number 
of  keyword  occurrences.  This  formula  is: 

Nv  =  3330  log  (K  +  10C00)  -  126C0 
where  the  total  number  of  keyword  occurrences,  K  ^  x 


Nkd«  Applying  this  formula 

to  the  files 

under  study 

results  in 

prod  1  c  ted 

.actual 

small  keyword 

2250 

2557 

large  keyword 

6250 

8044 

title  word 

5550 

13309. 

The  failure  of  the  equation  for  the  title  word  file  Is  due 
to  the  fact  that  the  equation  was  based  on  and  sejms  to 
be  only  applicable  to  manual  Indexing  systems  which  allow 
for  vocabulary  growth.  A  major  reason  for  the  >ctual 
•vocabulary  size  being  so  much  larger  than  tne  predicted 
alze  for  the  title  word  file  Is  the  large  number  of  unique 
terms. 

The  distributions  of  the  keywords  per  document 


for  the  three  files  are  shown  in  figure  A-t. 


Number  of  Documents 


APPENDIX  a 


DOCUMENT  RETRIEVALS 

B. 1  Retrieval  Requests 

In  order  to  test  the  ability  of  classification 
systems  to  group  similar  documents  into  cells  (categories), 
actual  retrievals  based  on  real  search  requests  were  per¬ 
formed  (see  Chapter  5;-  These  requests  were  obtained 
from  Gloria  Smith  of  the  Lawrence  Radiation  Laboratories. 

The  requests  originally  consisted  of  conjunctions, 
disjunctions,  ord  negations  of  E’JRATOM  keywords  and  NSA 
categories.  In  all,  there  were  177  requests  from  24 
nuclear  physicists  or  groups  of  nuclear  physicists.  These 
requests  are  in  active  use  at  Lawrence  Radiation  Labora¬ 
tories,  be  i  r.g  serially  matcned  against  eacn  semimonthly  Is¬ 
sue  of  the  NSA  Keyword  rile  (see  Appendix  A).  However, 
because  cf  one  expense  of  performing  retrospective  searches 
on  serial  flits,  these  requests  nave  not  been  used  for 
retrospective  searches. 

because  these  experiments  were  aimed  at  producing 
classification  systems,  tr. •»  NSA  categories  could  not  be 
used  for  retrieval.  In  addition,  the  negations  were  also 
not  Used  because  (a)  the  types  of  statistics  desired  from 
the  retrieval  experiments  would  nave  be**r  clouded  by  t.ne 
use  of  negation  and  (b)  the  utility  of  relation  w  "uld  be 
lessened  through  the  use  cf  an  on-.lr.o  system  wnic'".  per- 
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mitted  browsing.  In  order  to  eliminate  the  above  request 
items  while  retaining  the  original  meanings  of  the  re¬ 
quests,  some  requests  had  to  be  altered  somewhat  and  twelve 
of  them  dropped  completely.  This  left  1 6 5  retrieval  re¬ 
quests. 

Since  the  requests  were  in  terms  of  EURATON 
keywords,  no  additional  modifications  were  necessary  to 
apply  them  to  ^he  keyword  files.  However,  translation 
into  word  stems  was  required  for  the  title  word  file. 

Where  possible,  this  was  done  on  a  one-to-one  basis; 
however,  thf  aim  was  to  maintain  the  meaning  of  the  requests 
and  not  necessarily  their  exact  forms. 

Each  request  was  of  the  form 
v  A  2  v...An)  &  (B,  v  B2  v . . ,  Bm }  &  [C1  v  C2  v. ,,C  ) 
v  (D  (c  E  &  F)  v  (C  St  H  &  I)  v  (J  &  K  &  L) 
where  v  stands  for  logical  OR,  <&  stands  for  logical  AND, 

A  -  L  represent  keywords,  and  n,  m,  and  p  are  Integers, 

The  only  essential  part  of  this  expressions  is  A^ ,  The 
Integers  n,  m,  and  p  can  take  on  any  values,  but  the  highest 
encountered  in  the  1 6 5  requests  was  39» 

Some  examples  of  typical  requests  are  given  in 
Figure  B-l.  Statistics  for  the  165  requests  were  tabulated 
and  are  presented  in  Table  B-l.  The  terms  used  are  defined 
in  the  following  examples.  A  requ  it  of  (A  v  B)  &  (C  v  D) 

&  E  has  4  (=2x2x1)  three  conjunct  conjunctions,  1 
three  conjunct  expression,  and  5  three  conjunct  tokens. 
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Conjunctions 

Expressions 

Tokens 

One  conjunct 

190 

60 

190 

Two  conjunct 

1930 

111 

765 

Three  conjunct 

2728* 

22 

262 

Totals 

4848 

193 

1244 

Average  per 
Question  (+165) 

29.4 

1.2 

7.5 

*2280  of  these  are  the  result  of  just  three  expres¬ 
sions,  the  largest  being  39  x  6  x  5  =  11 ?0 
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Request  Statistics 
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A  request  of  (A  v  B)  v  {C  &  D)  has  2  {=2}  one  conjunct 
conjunctions,  1  one  conjunct  expression,  2  one  conjunct 
*t ol e'ns ,  1  (=1  x  1)  two  conjunct  conjunction,  1  two 
conjunct  expression,  and  2  two  conjunct  tokens. 

B. 2  Documents  Retrieved 

These  requests  were  applied  ter  the  three  files 
numerous  times  during  the  course  of  the  experiments. 

Table  B-2  snows  the  number  of  retrievals  (the  document 
abstract  numbers  were  actually  retrieved)  and  the  number 
of  documents  they  represent  for  each  of  the  files.  Be¬ 
cause  of  the  different  indexes  in  the  two  large  files, 
the  number  of  documents  retrieved  was  substantially 
different  even  though  the  files  consist  of  essentially  the 
same  documents. 

The  discrepancies  between  the  number  of  retrievals 
and  the  actual  documents  retrieved  is  due  to  the  fact 
tnat  some  documents  were  retrieved  in  response  to  more 
than  one  request.  The.  distributions  of  the  number  of 
times  eacn  document  wes  retrieved  are  shown  in  Figure  B-2, 
It  was  found  that  these  distributions  approximate  straight 
lines  (semi -log  paper).  It  should  be  noted  that  the  most 
"popular"  document  (large  keyword  file)  was  retrieved  in 


response  to  22  requests 
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